Speech emotion recognition

Speech Emotion Recognition
Guided by:
Mrs. R.K. Patole
111707049-Pragya Sharma
141807005-Kanchan Itankar
141807008-Saniya Shaikh
141807009-Triveni Vyavahare

Aim
Speech Emotion
Recognition using
Machine Learning

[1] Speech Emotion Recognition using Neural Network and MLP Classifier (Jerry
Joy, Aparna Kannan, Shreya Ram, S. Rama)
● MLP Classifier
● 5 features extracted- MFCC, Contrast, Mel Spectrograph Frequency, Chroma
and Tonnetz
● Accuracy 70.28%
[2]Voice Emotion Recognition using CNN and Decision Tree (Navya Damodar,
Vani H Y, Anusuya M A.)
● Decision Tree , CNN
● MFCCs extracted
● Accuracy 72% CNN, 63% Decision Tree
Literature Review

● To build a model to recognize emotion from speech using the librosa and
sklearn libraries and the RAVDESS dataset.
● To present a classification model of emotion elicited by speeches based on
deep neural networks MLP Classification based on acoustic features such as
Mel Frequency Cepstral Coefficient (MFCC). The model has been trained to
classify eight different emotions (calm, happy, fearful, disgust, angry, neutral,
surprised,sad).
Objective

Applications
Business Marketing
Suicide prevention
Voice Assistant

● As human beings speech is amongst the most natural way to express ourselves. We depend
so much on it that we recognize its importance when resorting to other communication
forms like emails and text messages where we often use emojis to express the emotions
associated with the messages. As emotions play a vital role in communication, the detection
and analysis of the same is of vital importance in today’s digital world of remote
communication.
● Emotion detection is a challenging task, because emotions are subjective. There is no
common consensus on how to measure them. We define a Speech Emotions Recognition
system as a collection of methodologies that process and classify speech signals to detect
emotions embedded in them.
Motivation

● Human machine interaction is widely used nowadays in many applications. One of the medium
of interaction is speech. The main challenges in human machine interaction is detection of
emotion from speech.
● Emotion can play an important role in decision making. Emotion can be detected from different
physiological signal also. If emotion can be recognized properly from speech then a system can
act accordingly. Identification of emotion can be done by extracting the features or different
characteristics from the speech and training needed for a large number of speech database to
make the system accurate.
● An emotional speech RAVDESS dataset is selected then emotion specific features are extracted
from those speeches and finally a MLP classification model is used to recognize the emotions.
Introduction

Methodology
Preprocessing Feature Extraction Classification

1.Preprocessing
The removal of unwanted noise signal from the speech.
➢Silent removal
➢Background Noise
removal
➢Windowing
➢Normalization

2.Feature Extraction
● Extract the feature from audio file
● Used to identify How we speak
➢ Pitch
➢ Loudness
➢ Rhythm,etc

Dataset
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset.
● [3]RAVDESS dataset has recordings of 24 actors, 12 male actors and 12 female
actors, the actors are numbered from 01 to 24 in North American accent.
● All emotional expressions are uttered at two levels of intensity: normal and strong,
except for the ‘neutral’ emotion, it is produced only in normal intensity. Thus, the
portion of the RAVDESS, that we use contains 60 trials for each of the 24 actors,
thus making it 1440 files in total.

3.Classification
● Match the feature with corresponding emotions
Multilayer Perceptron

Multi-Layer Perceptron Classifier
● A multilayer perceptron (MLP) is a class of feedforward
artificial neural network (ANN).
● MLP consists of at least three layers of nodes-input
layer,hidden layer and output layer.
● MLPs are suitable for classification prediction problems
where inputs are assigned a class or label.

Building the MLP Classifier involves the following steps-
1. Initialisation MLP Classifier.
2. Neural Network.
3. Prediction.
4. Accuracy Calculation.

Multi-Layer Perceptron Classiﬁer
Fig:- Multi-Layer Perceptron Classifier

Feature Extraction
From the Audio data we have extracted three key features which have been used in this, namely:
● MFCC (Mel Frequency Cepstral Coefficients)
● Mel Spectrogram
● Chroma

MFCC (Mel Frequency Cepstral Coefficients)

Mel Spectrogram
A Fast Fourier Transform is computed on overlapping windowed segments of the signal,
and we get what is called the spectrogram. This is just a spectrogram that depicts amplitude
which is mapped on a Mel scale.
Chroma
A Chroma vector is typically a 12-element feature vector indicating how much energy of
each pitch class is present in the signal in a standard chromatic scale.

Confusion Matrix
1.angry
2.calm
3.disgust
4.fearful
5.happy
6.neutral
7.sad
8.surprised

● The proposed model achieved an accuracy of 66.67%.
● Calm was the best identified emotion.
● The model gets confused between similar emotions like calm-neutral, happy-surprised.
● We tested the model on our own voice file for the sentence “Dogs are sitting by the door” and it
identified the emotion correctly.
Conclusion

Future Work
● The system could take into consideration multiple speakers from different geographic locations
speaking with different accents.
● Though standard feed forward MLP is powerful tool for classification problems, we can use
CNN, RNN models with larger data sets and high computational power machines and compare
between them.
● Study shows that people suffering with autism have difficulty expressing their emotions
explicitly. Image based speech processing in real time can prove to be of great assistance.

References
[1] Jerry Joy, Aparna Kannan, Shreya Ram, S. Rama Speech Emotion Recognition using Neural
Network and MLP Classifier, IJESC, April 2020.
[2]Navya Damodar, Vani H Y, Anusuya M A. Voice Emotion Recognition using CNN and
Decision Tree. International Journal of Innovative Technology and Exploring Engineering
(IJITEE), October 2019.
[3]RAVDESS Dataset: https://zenodo.org/record/1188976#.X5r20ogzZPZ
[4]MLP/CNN/RNN Classification:
https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/
[5]MFCC:https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd

Speech emotion recognition

More Related Content

What's hot (20)

Similar to Speech emotion recognition (20)

Recently uploaded (20)

Speech emotion recognition