Personal tools
You are here: Home Members Muharram Mansoorizadeh Human Emotion Recognition Using Facial Expression and Speech Features Fusion [My Phd Thesis; In Persian]:

Human Emotion Recognition Using Facial Expression and Speech Features Fusion [My Phd Thesis; In Persian]:

Recently, emotion recognition has been a major research topic in the area of human computer interaction (HCI). Emotion is expressed via facial movements, speech prosody and text, body and hand gestures, and various biological signals such heart rate. Available works on emotion recognition can be divided into three main groups. The first group contains unimodal face based or speech based recognition systems. The second group of efforts combines unimodal recognition systems at decision level. Approaches from the third group combine and fuse emotion related features from underlying modalities and perform classification using the mixed features. Decision level fusion ignores possible relationships between features from different modalities. For instance, Anger and fear have similar facial cues but their vocal patterns are different. It is desirable that a classifier using both vocal facial cues would correctly distinguish these emotions. On the other side, psychological studies show that emotional cues in face and speech are not strictly aligned. For example, raising inner brows , as a facial cue of anger, could be seen shortly before or after the increase in speech tone, as the vocal cue of anger. This asynchrony makes feature fusion difficult. The propose model in this thesis applies information fusion both feature and decision levels. Features from speech and face are extracted as time series. A middle way active buffer stores the last feature measurement for synchronization of interrelated features from face and speech. For any time instance, if a measurement of the feature is available, then the buffer will be updated with it. If the feature is absent, then buffer content will be used as the current feature value; of course up to some preset time. For longer absence of the feature, the buffer will be updated with some emotionally neutral value, such as mean or median of the feature values. By application of these active buffers, features from face and speech that are related to the same emotional event will have more chance to overlap temporally and hence be fused together. For final recognition of emotions three types of classifiers are combined in decision level. Unimodal speech based and face based classifiers are combined by the third classifier operating on fused features from face and speech. This combination layer brings robustness to the proposed model so that the recognition process will go in in the case that one of the modalities is absent. Several experiments conducted with the proposed model using two audiovisual emotion databases, namely eNterface'05(English, 42 subjects) and TMU-EMODB (Persian, 12 subjects). The results show that asynchronous fusion in feature and decision levels have similar recognition rates, which is 40 % and 65% for first and second DB, respectively. However, feature fusion after synchronization enhances these rates up to 70% and 75%. Also, the results show that the hybrid fusion in proposed model competes with the best expected results from the combination of the base classifiers.

Muharram Mansoorizadeh
Muharram Mansoorizadeh
BuAli Sina University
Key research interests:

Multimodal Emotion Recognition

mansoorm_phd_thesis.pdf — PDF document, 6908Kb

Document Actions
Powered by Plone

Portal usage statistics