Poster presentation instruction

1. Poster presentation participants need to submit presentation title and abstract using the provided MS Word template. Please make sure to follow the instruction and example given in the template.

2. During the KMSO event, participants are expected to deliver ~10-15 min presentation in front of individual poster judges followed Q&A.

Emotion Recognition from Speech Audio through Convolutional Neural Networks

Elvin Ko1, Yanchen Liu2, Xiaofan (Fred) Jiang2

1AEDT, Bergen County Academies

2Department of Electrical Engineering, Columbia University

We study the problem of recognizing human emotion from speech audio through machine learning approaches. Specifically, we explore how deep learning models can distinguish emotional characteristics in human speech from raw audio signals, rather than analyzing textual representation extracted from the speech data through natural language processing. Emotion recognition from raw speech audio data is advantageous because a person’s emotion can often be captured from (and sometimes only from) the acoustic characteristics (e.g. tone, pitch, speed, etc.). This emotion detection is for use in a mental status exam, which more broadly analyzes multiple aspects of speech and video to build a mental health profile for a user.  In our study, we use the IEMOCAP dataset [1], which contains 10039 speech audio samples with 7 labels for emotions, and employ a convolutional neural network (CNN) model with 4 one-dimensional convolutional layers and two fully connected layers, implemented in Keras/Tensorflow. Mel-frequency cepstral coefficients (MFCC) of the audio samples are used as inputs to the models for training and testing. When classifying all 7 emotions, the model achieves the test accuracies of up to 46%. In a binary classification between the emotions anger and neutral, the model achieves an accuracy of 92%, while a ternary classification of sadness, anger, and neutral state achieves an accuracy of 73%. We posit that the relatively low accuracy is, at least partially, accounted for two factors. First, because the dataset was unbalanced, with there being over 10 times as many data samples for emotions of happiness, and anger compared to surprise, the model struggles to predict emotions with less training data (the confusion matrix confirms it). Additionally, due to the size of the dataset compared to the size of the model used, the model frequently runs into the issue of overfitting.   To combat the first issue of unbalanced data, we perform a data augmentation, where the original audio data of those emotions with less data (surprise and happiness) are time-shifted to generate more samples. For the overfitting problem, we add dropout and pooling layers after each convolutional layer to improve the generalizability of the model. Utilizing these methods, we are able to achieve an accuracy of 54% when classifying all 7 emotions [2]. Future considerations include combining multiple datasets in order to create a more robust and balanced training set in order to account for the limitations encountered in our development.

References

[1] IEMOCAP Database, https://sail.usc.edu/iemocap/

[2] Deeper architectures such as ResNet-18 were also experimented with, but due to the size of the model compared to the relatively small dataset, the model struggled even more with overfitting, leading to an accuracy of 49%