Muharram Mansoorizadeh
Phd Candidate, Computer Eng.
|
Muharram Mansoorizadeh BuAli Sina University |
Key research interests: Multimodal Emotion Recognition |
-
Multimodal information fusion application to human emotion recognition from face and speech
- Abstract -A multimedia content is composed of several streams that carry information in audio, video or textual channels. Classification and clustering multimedia contents require extraction and combination of information from these streams. The streams constituting a multimedia content are naturally different in terms of scale, dynamics and temporal patterns. These differences make combining the information sources using classic combination techniques difficult. We propose an asynchronous feature level fusion approach that creates a unified hybrid feature space out of the individual signal measurements. The target space can be used for clustering or classification of the multimedia content. As a representative application, we used the proposed approach to recognize basic affective states from speech prosody and facial expressions. Experimental results over two audiovisual emotion databases with 42 and 12 subjects revealed that the performance of the proposed system is significantly higher than the unimodal face based and speech based systems, as well as synchronous feature level and decision level fusion approaches. Keywords Multimodal feature extraction - Multimodal information fusion - Human computer interaction - Multimodal emotion recognition
-
Face and Facial Feature Tracking
- Here are some initial results of facial feature tracking.
-
Speech Emotion Recognition: Comparison of Speech Segmentation Approaches
- Recognition of emotional states carried in speech, is of a great interest in modern human computer interaction developments. To reliably detect the aroused emotion, a sufficiently long continuous speech segment is required. This research aims to analyze different segmentation approaches of speech signals. Berlin emotional speech database is used for data set generation. Time frame based and voiced segmentation approaches are applied and compared. The experimental results show that accurate emotion recognition is obtained when the speech segments are longer than a second or are composed of 10 to 12 voiced segments. Based on the findings of this research, voiced based segmentation generates more accurate results than the other methods
-
Facial Feature Localization(includes demo)
- I've developed some MATLAB(R) scripts for facial feature localization. The included clip is a demo of scripts in action. I'm looking for guidelines to pack and publish the scripts so that it can be used by others.
-
The Database
- The Database
-
Sample clip from the audio-visual emotion database
- The subject acts like a man enjoying the blue sky.
-
Audio-Visual Emotion Database in Persian Language
- I'm creating an audio-visual emotion database in Persian language. currently, recording phase is complete and audio/speech data is archived as digital streams. Clipping and annotation has just began.
-
Hybrid feature and decision level fusion of face and speech information for bimodal emotion recognition
- A hybrid feature and decision level information fusion architecture is proposed for human emotion recognition from facial expression and speech prosody. An active buffer stores the most recent information extracted from face and speech. This buffer allows fusion of asynchronous information through keeping track of individual modality updates. The contents of the buffer will be fused at feature level; if their respective update times are close to each other. Based on the classifiers' reliability, a decision level fusion block combines results of the unimodal speech and face based systems and the feature level fusion based classifier. Experimental results on a database of 12 people show that the proposed fusion architecture performs better than unimodal classification, pure feature level fusion and decision level fusion.
-
Human Emotion Recognition Using Facial Expression and Speech Features Fusion [My Phd Thesis; In Persian]:
- Recently, emotion recognition has been a major research topic in the area of human computer interaction (HCI). Emotion is expressed via facial movements, speech prosody and text, body and hand gestures, and various biological signals such heart rate. Available works on emotion recognition can be divided into three main groups. The first group contains unimodal face based or speech based recognition systems. The second group of efforts combines unimodal recognition systems at decision level. Approaches from the third group combine and fuse emotion related features from underlying modalities and perform classification using the mixed features. Decision level fusion ignores possible relationships between features from different modalities. For instance, Anger and fear have similar facial cues but their vocal patterns are different. It is desirable that a classifier using both vocal facial cues would correctly distinguish these emotions. On the other side, psychological studies show that emotional cues in face and speech are not strictly aligned. For example, raising inner brows , as a facial cue of anger, could be seen shortly before or after the increase in speech tone, as the vocal cue of anger. This asynchrony makes feature fusion difficult. The propose model in this thesis applies information fusion both feature and decision levels. Features from speech and face are extracted as time series. A middle way active buffer stores the last feature measurement for synchronization of interrelated features from face and speech. For any time instance, if a measurement of the feature is available, then the buffer will be updated with it. If the feature is absent, then buffer content will be used as the current feature value; of course up to some preset time. For longer absence of the feature, the buffer will be updated with some emotionally neutral value, such as mean or median of the feature values. By application of these active buffers, features from face and speech that are related to the same emotional event will have more chance to overlap temporally and hence be fused together. For final recognition of emotions three types of classifiers are combined in decision level. Unimodal speech based and face based classifiers are combined by the third classifier operating on fused features from face and speech. This combination layer brings robustness to the proposed model so that the recognition process will go in in the case that one of the modalities is absent. Several experiments conducted with the proposed model using two audiovisual emotion databases, namely eNterface'05(English, 42 subjects) and TMU-EMODB (Persian, 12 subjects). The results show that asynchronous fusion in feature and decision levels have similar recognition rates, which is 40 % and 65% for first and second DB, respectively. However, feature fusion after synchronization enhances these rates up to 70% and 75%. Also, the results show that the hybrid fusion in proposed model competes with the best expected results from the combination of the base classifiers.


W3C Workshop on Emotion Markup Language
