Building the Emotion Recognition Engine

It is extraordinary how technological evolution leaves its mark on the evolution of mankind. At the same time, it is equally extraordinary how deep learning has evolved.

With the rapid development of deep learning, emotion recognition has received more and more attention from researchers. Computers can reach their true level of intelligence only when they have human emotions, and emotion recognition is the focus.

Our Emotion Recognition Engine

The development process of Repsmate’s emotions recognition engine, meant to improve the interactions between reps and customers in callcenters, used the recent science in understanding emotions and how we learn to express and recognize them. Developing the solution involved both unsupervised and supervised machine learning approaches, simulating the way humans learn these complex and intricated aspects of human interactions – emotions while harnessing the power of AI in computing huge amounts of data. 

The first stage involved an unsupervised exploration, where models were trained on 2000 hours of call center conversations to identify variables relevant to emotion recognition – both from text & voice.

These models can synthesize contextual information into vectors of fixed length (e.g., 728 variables for textual information, 512 variables for speech) which are then used as a foundation for the emotion recognition model. Just as we humans, in the first years of life prospect the environment and detect relevant cues for “reading others” emotions. A higher pitch and a more rapid pace could mean positive excitement or high irritability, a louder voice could indicate anger or just the desire to be heard. So, the models detected variables relevant to the task.  

In the next stage, the approach used in supervised learning was inspired by good practices in the psychological assessment of human behavior.

Systematic observation of recorded call center conversations was realized by trained human assessors/coders, that were instructed to identify predetermined relevant emotions. These emotionally labeled conversation sequences were used as input for training the model. 

Initially, we determined a list of relevant emotions for improving the call center interactions (customer satisfaction, agents burnout/attrition prevention, etc.) and define them in an understandable/operational manner. The initial list was created based on a scientific literature review (combining the affective-circumplex model with cognitive appraisal theories of discrete emotions and deciding based on a meta-analysis** that integrates the answers of over 40,000 respondents regarding the relationship between customer emotions and consuming behavior (satisfaction, buying behavior and word-of-mouth) and a survey we implemented with (over 50) experienced customer support & sales representatives, that gathered data about the frequency and relevance of emotions showing up in sales & customer support conversations.

We classified emotions into positive and negative and decided to include somewhat different emotions for customers – grateful, delighted, satisfied, angry, irritated, and unbelieving and for agents – proud, joyful, disappointed, contemptuous, bored, and worried/unsure. These emotions are important for the manager because will help him notice everything needed to prevent attrition, burnout, or other counter-productive behaviors.

Further on, a team of psychologists was trained to accurately detect the emotions in call center conversations and label the segments of conversation with the identified emotion. After the first hours of labelled conversations, we calibrated the list.

For example, ”curious” as an emotion encountered by the customer was in certain contexts a positive emotion inviting the rep to provide more detailed information, while in other contexts “curious” was hiding irony requiring a change of approach strategy from the reps. 

This multimodal, “natural” way of detecting emotions – both from text & voice – has been applied to over 100 hours of call center conversations, creating a database of emotionally labelled sequences of conversations that have been used to train a model. 

In the third stage, we integrated the two approaches, described above.

An emotion recognition training step starts with the synthesized contextual information of the conversation, projects it in a high dimensional vector space (which opens up multiple pathways to link the contextual features to the labelled emotional features), applies positional encodings (to imply a time-series behaviour of the data such that the model learns that the context elements happened in a certain order), applies ‘self-attention’ layers (to discover inter-dependence between elements within the context)  and tries to fit in the labeled emotions associated to the given context. The error between the emotions predicted by the model in this step is then used to calibrate the weights for a more accurate prediction in the next training step and so on.

The accuracy of emotion recognition in this first version of the model looks promising. The multimodal approach, which combines input both from text and voice/speech, spreads between F1=0.751 to  F1=0.83, meaning that the trained model can detect accurately up to 83% of emotions in conversations, after controlling the distribution of emotions within the sample. In other words, the current version of RepsMate can detect positive emotions expressed in conversations with an accuracy ranging from 75% for customers’ emotions to over 79% of reps’ positive emotions. 

The negative emotions expressed in conversations are identified with an accuracy of over 82% for reps and 78% for customers. Just like in real life, RepsMate is more vigilant and accurate in detecting negative emotions (irritated, angry, skeptical, disappointed), as they are more relevant for identifying (dangerous) situations that require immediate action. Being able to identify quickly the critical incidents where reps expressed disappointment (dezamagit), insecurity (nesigur), confusion (confuz), or irritation (iritat) represent valuable inputs for coaching & development meetings, being much more cost effective than randomly sampling conversations for such learning discussions. 

The higher performance in emotions detection for reps is due to the characteristics of the data set – during the conversations the reps speak more, clearer (better audio quality of the recording), within the same set-up. As the database will increase, so will the accuracy of the predictions. 

To increase the accuracy of emotion recognition, the next step will involve providing context to the emotionally labelled sequences of conversations, and training the model to make inferences about underlying appraisals, such as valence, arousal, dominance, fairness, or norm appropriateness. 

@RepsMate we fundament our work on extensive research, providing our clients with evidence-based practical solutions that facilitate communication between reps & customers and make interactions count.

*Cowen, A., Elfenbein, H.A, & Laukka, P., (2018) Mapping 24 emotions conveyed by brief human vocalization, American Psychologist, DOI:10.1037/amp0000399 (

**Kranzbuhler, A.M., Zerres, A., Kleijnen, M.H.P., & Verlegh, P.W.J, (2018) – Beyond valence. A meta-analysis of discrete emotions in firm-customer encounters, Journal of the Academy of Marketing Science,

Make interactions count!

The digital era teaches us to adapt at an alert pace, which also forced contact centers to create their strategy for approaching customers around what they feel.

Share on Facebook
Share on Twitter
Share on LinkedIn

If you are sure that you want to join a mission to change the world, come to the RepsMate team!