Are you mad at me?

Dec. 3, 2017, by Oriol Martínez

To accurately infer the affective state of a subject is a very challenging task even for humans. How many times have you asked to a friend: are you mad at me? and the answer has been different from what you expected.In the KRISTINA project we are teaching an automatic system to performthat task. Isthat possible?

Machine learning (ML) is a hot topic within Artificial Intelligence (AI) that tries to teach machines using samples instead of explicitly programming them. For instance, if we want to teach a machine to infer if a person is happy or not from its face, we must provide the machine with tons of face images of happy people (e.g. smiling),tons offace imagesofpeoplethat are not happy and finally,their corresponding labels: happy, not happy.

To provide those labels, we need a “human expert” who mustindicate which images contain happy faces and which images do not. At the end of the ML process, the automatic system should be able to decide if a person is happy or not by observing its face -hopefully-with a similar accuracy to that ofthe expert.Given that the whole spectrum of human emotions is complex (much more than happy or not happy) and that in many cases affective states are shown in a subtle way, what happens if the expert is wrong?

A common solution to minimizehuman errorsis to use a team of experts instead of only one. Then, given a face imagewecould provide to the machinetheknowledge from different expertsin the form ofa consensus label from all their annotations.An intuitive solution to obtain the consensus label is to apply “majority voting” and consider as ground truth the most voted label. Given that there are people with better skills perceiving emotions than others, it has been found that majority voting is not the best solution, especially if we consider a much more complex task than simply putting a binary label: happy, not happy. 

In KRISTINA we use the Circumplex model of affect to describe emotions [1].This model identifies Valence and Arousal (V-A) as the underlying dimensions of human affect. Valence refers to how pleasant or unpleasant is an emotion while Arousal indicates how calm or excited is a subject. Each dimension is represented in the continuous range that goes from -1 to 1. For instance, in the case of Valence, negative values imply unpleasant emotions and positive values pleasant ones.If an expert wants to label a face image in terms of Valence he/she must choose a continuous value within the range [-1,1]. While the task may seem not difficult, to maintain a certain criterion along different face images is not an easy task.Moreover, we cannot use the majority voting approach because labels are now continuous values.The easiest solution is to average the labels from multiple expertsand use the meanasthe ground truth. However, is this the best solution?

At the Cognitive Media Technologies (CMTech) group we have developed a method that estimates the consensus from several V-A labels that outperformsthe simple averaging or majority voting strategies [2]. We have proposed an approach that simplifies the labeling task by discretizing each dimension of the V-A space in 5 labels: low, mid-low, neutral, mid-high and high. The discretization of the V-A space facilitates self-consistency of the labels produced by an expert duringthe whole annotation process. Moreover, we consider that each expert may perceive emotions subjectively. For instance,an expert could be labelling mid-high Valence when all the others are labelling the same faces as high Valence. Our method takes into account that type of relations and uses them to reach a better consensus. As you can see in the Figure, we consider that each annotator has its own subjective perception model where the ordinality between labels is maintained and itsvariability can be described by certain bias (deviation) and scale.


Providing ML systems with better labelled data will improve their prediction skills. However, the work of labelling tons and tons of data will continue for every problem we want to solve.For the KRISTINA project a team of 14 annotators has annotated more than 5 hours of video using the explained methodology. As a result, KRISTINA avatar has started to understand the emotions ofpeople who are talking to it.


[1] J. A. Russell. “A circumplex model of affect”. Journal of Personality and Social Psychology, 1980.
[2] A. Ruiz, O. Martinez, X. Binefa and F.M. Sukno. “Fusion of Valence and Arousal Annotations through Dynamic Subjective Ordinal Modelling”. 12th IEEE International Conference on Face and Gesture Recognition, Washington, DC, USA, 2017.

multilingual intelligent embodied agent social competence adaptive dialogue expressive speech recognition and synthesis discourse generation vocal facial and gestural social and emotional cues