Welcome to the fascinating world of speech emotion recognition! In this blog, we will guide you through the process of building and training a Speech Emotion Recognition (SER) system. This tool can recognize human emotions from speech, which is not only a technological marvel but also serves various industries from product recommendations to affective computing. Let’s dive in!
1. Introduction
The Speech Emotion Recognition system is designed to train machine learning and deep learning algorithms to detect emotions in human speech. The emotions recognized include neutral, calm, happy, sad, angry, fear, disgust, pleasant surprise, and boredom.
For a deep dive, check out this tutorial.
2. Requirements
- Python 3.6+
- Python Packages:
- tensorflow
- librosa==0.6.3
- numpy
- pandas
- soundfile==0.9.0
- wave
- scikit-learn==0.24.2
- tqdm==4.28.1
- matplotlib==2.2.3
- pyaudio==0.2.11
- [ffmpeg](https://ffmpeg.org) (optional) – used to manipulate audio files when necessary.
Install these libraries using the command: pip3 install -r requirements.txt
3. Collecting the Dataset
The repository uses four datasets, stored in the data folder:
4. Feature Extraction
Feature extraction is akin to taking the fingerprints of emotions from speech. Just as fingerprints can uniquely identify a person, features from audio that represent emotions can help the system uniquely recognize what emotion is being expressed. In this repository, we use features from the librosa library, such as:
- MFCC (Mel Frequency Cepstral Coefficients)
- Chromagram
- MEL Spectrogram Frequency
- Contrast
- Tonnetz (tonal centroid features)
5. Building the Model
The excitement of building models lies in trial and error—much like trying to bake a cake! You have to get the ingredients (features) just right to ensure it rises and tastes good (accurate predictions).
Example 1: Using 3 Emotions
Here’s a simple way to build and train a model for recognizing three emotions (sad, neutral, happy):
from emotion_recognition import EmotionRecognizer
from sklearn.svm import SVC
my_model = SVC()
rec = EmotionRecognizer(model=my_model, emotions=['sad', 'neutral', 'happy'], balance=True, verbose=0)
rec.train()
print('Test score:', rec.test_score())
print('Train score:', rec.train_score())
Determining the Best Model
To find the best model, you can load the pre-trained estimators and check their accuracy:
rec.determine_best_model()
print(rec.model.__class__.__name__, 'is the best')
print('Test score:', rec.test_score())
6. Troubleshooting
If you encounter issues, consider the following:
- Ensure all required packages are installed.
- Verify the audio file format is correct (16000Hz and mono channel).
- Check if ffmpeg is installed and added to your path.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
7. Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Now you have the tools to embark on your journey into Speech Emotion Recognition! Happy coding!

