Harold Rodriguez - Research Home

Detecting Emotion in the Voice: A Model and Software Implementation

The proposed model uses 3-Dimensional space to map emotions inherent in fluctuations of the human voice, while the software implementation analyzes these fluctuations.

Background

There has been a lot of work done on voice recognition in the past few years, largely due to increased computing power. It is not uncommon today to have, after voice calibration, 95% accuracy in word recognition. But word recognition and emotion detection are not the same thing. Whether one says, “Tell him to get down here,” or screams, “Tell him to get down here!” into the microphone, the speech engine will gently scribe the former (assuming the microphone still works).

Ultimately, one would like the unification of both recognitions, where, “Ah,” and “Ah!” are distinct. While the word recognition camp seems to have made great strides in past decades, the emotion recognition camp is struggling to keep up.

Current Detection Methods

Today’s greatest need for emotion detection software comes from the telecommunications industry. For example, the manager of a large call center could run software on the telephone lines to analyze customer calls. This would alleviate his having to go through every call, and he would instead focus on those that were considered “angry”. He might even consider re-evaluating an employee with a high number of “first happy, then angry” customers.

Current detection methods are only good at detecting four basic emotions: Anger, Sadness, Happiness, and Neutrality. Even in such a limited scope, the successful detection rate is near 75%, hardly close to the 95% for word recognition.

Symbolic Mapping

These methods usually rely on a relation between arousal and valence in the voice. Arousal can be measured by “highness” or “lowness”, while valence can be measured in “positive” or “negative”. For instance, ‘anger’ would have high arousal and negative valence.

These parameters have some drawbacks. For one, they don’t translate into anything computational. One can not determine “valence” from a waveform. You may attempt to use, however, pitch and volume to roughly correspond to the parameters.

3-Dimensional Mapping

This research ambitiously proposes including several low-percentage emotions (those that are detected with < ≈40% accuracy) with the use of a vector in R3 to analyze a voice. The voice will be modeled after the calculatable quantities: pitch, volume, and idle percentage. Both pitch and volume will be determined through waveform analysis, and idle percentage will be defined as the ratio: duration of the inaudible segments divided by the duration of the audible segments in the recording buffer.

Using these three parameters, we can attempt to detect Anger, Sadness, Happiness, Neutrality, Laughing, Boredom, and Defensiveness.

View the
accompanying
paper [PDF]

Harold Rodriguez - Research Home