Toolbox to compare a decoder (e.g. machine classifier or a single human labeller) to the human labellers.
Traditionally, a classifier is evaluated by calculating the recognition rate which tells you how many items are classified correctly.
This assumes that a reference label is available and that the labellers can agree on one single label with an acceptable inter-labeller consistency. But this is not always possible, e.g. when working with realistic weak emotions.
Often, similar classes are confused systematically while other classes are not confused at all.
Does the decoder make the same "errors" as the reference labelers?
Our tool can answer this question with an entropy based measure which works not only for numerical labels but also categorical labels.
S. Steidl, M. Levit, A. Batliner, E. Nöth, H. Niemann: "Of All Things the Measure is Man" - Automatic Classification of Emotions and Inter-Labeller Consistency, Proc. ICASSP 2005, p. 317-320