Using AI To Recognize Emotions In Language

In the world of communication, words are key, but what we don’t say can speak volumes. People often pick up on emotions through the way others sound, even if they’re not explicitly stated.

Researchers at the Max Planck Institute for Human Development wanted to see if machines could do the same. They tested three different computer models to see how well they could detect various emotions just from listening to short voice recordings.

Recognizing emotions

“Here we show that machine learning can be used to recognize emotions from audio clips as short as 1.5 seconds,” the authors explain. “Our models achieved an accuracy similar to humans when categorizing meaningless sentences with emotional coloring spoken by actors.”

The researchers collected random sentences from two different sets of data—one from Canada and one from Germany. This allowed them to see if machine learning models could understand emotions regardless of language, culture, or what the sentences actually meant. Each sentence was cut down to just 1.5 seconds long, which is how quickly humans usually recognize emotions in speech. It’s also short enough to avoid mixing up different emotions.

They looked at six emotions: happiness, anger, sadness, fear, disgust, and neutral. Using this data, the researchers trained three types of machine learning models: Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), and a hybrid model (C-DNN). DNNs work by analyzing different aspects of sound, like pitch or volume, to figure out the emotions behind them.

CNNs look for patterns in how sound looks visually, kind of like recognizing emotions from the rhythm and texture of someone’s voice. The hybrid model combines both techniques, using both sound and its visual representation to predict emotions. Finally, they put these models to the test using both datasets to see how well they worked.

“We found that DNNs and C-DNNs achieve a better accuracy than only using spectrograms in CNNs,” the researchers explain. “Regardless of model, emotion classification was correct with a higher probability than can be achieved through guessing and was comparable to the accuracy of humans.”

Robust performance

“We wanted to set our models in a realistic context and used human prediction skills as a benchmark,” they continue. “Had the models outperformed humans, it could mean that there might be patterns that are not recognizable by us.”

The similarity in performance between untrained humans and models suggests a common reliance on recognition patterns. These discoveries suggest that creating systems capable of swiftly interpreting emotional signals, offering quick and intuitive responses, is feasible across diverse contexts.

This breakthrough holds promise for scalable, cost-effective applications in critical domains like therapy and interpersonal communication technology, where grasping emotional context is paramount.

However, the researchers acknowledged certain constraints in their study. They noted that sentences spoken by actors might not capture the complete range of authentic, spontaneous emotions. Furthermore, they emphasized the need for future exploration into audio segments of varying lengths to determine the optimal duration for accurate emotion recognition.

Facebooktwitterredditpinterestlinkedinmail