In this blog post, we’ll explore how to leverage the TencentGameMatechinese-hubert-base pre-trained model to recognize emotions from speech using the CASIA dataset. With the ability to decode six distinct emotions primarily in Chinese—anger, fear, happiness, neutrality, sadness, and surprise—we can unlock the power of AI in understanding emotional context from vocal inputs!
Prerequisites
- Python 3.x installed on your machine.
- Libraries:
librosa,torch,transformers. You can install these using pip: -
pip install librosa torch transformers
Understanding The Code Through Analogy
Think of the emotion recognition process as a seasoned chef preparing a gourmet dish. The chef (the model) uses fresh ingredients (the audio data) and a well-defined recipe (the coding logic) to deliver the taste (emotion). Just as the chef finely chops various vegetables and spices to achieve the perfect flavor, our model analyzes audio samples using intricate calculations to identify emotions.
Here’s how the code flows like a recipe:
- Ingredients Preparation: We load the speech data and process it to ensure it’s in the right format. Just like preparing fresh vegetables before cooking.
- Mixing and Cooking: The model, via a series of layers and functions (like the cooking steps), processes the audio data to extract features and emotions from it.
- Tasting: Finally, we predict which emotion is present in the audio, akin to a chef tasting the dish to ensure it’s made to perfection.
Setting Up the Model
Here’s how to set up the model for speech emotion recognition:
import os
import random
import librosa
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoConfig, Wav2Vec2FeatureExtractor, HubertPreTrainedModel, HubertModel
model_name_or_path = "xmj2002/hubert-base-ch-speech-emotion-recognition"
duration = 6
sample_rate = 16000
config = AutoConfig.from_pretrained(model_name_or_path)
def id2class(id):
if id == 0:
return "anger"
elif id == 1:
return "fear"
elif id == 2:
return "happy"
elif id == 3:
return "neutral"
elif id == 4:
return "sad"
else:
return "surprise"
def predict(path, processor, model):
speech, sr = librosa.load(path=path, sr=sample_rate)
speech = processor(speech, padding=True, truncation=True, max_length=duration * sr, return_tensors='pt', sampling_rate=sr).input_values
with torch.no_grad():
logit = model(speech)
score = F.softmax(logit, dim=1).detach().cpu().numpy()[0]
id = torch.argmax(logit).cpu().numpy()
print(f"File path: {path} | Predicted Emotion: {id2class(id)} | Confidence Score: {score[id]}")
Training the Model
To fine-tune the model on our CASIA dataset, you would configure various training parameters:
# Training settings
dataset_split_ratio = "60:20:20"
seed = 34
batch_size = 36
learning_rate = 2e-4
optimizer = torch.optim.AdamW(betas=(0.93, 0.98), weight_decay=0.2)
scheduler = torch.optim.lr_scheduler.StepLR(step_size=10, gamma=0.3)
dropout = 0.1
parameters = []
for name, param in model.named_parameters():
if "hubert" in name:
parameters.append({"params": param, "lr": 0.2 * learning_rate})
else:
parameters.append({"params": param, "lr": learning_rate})
Testing the Model
The setup allows you to predict emotions from random audio samples. You would proceed to do this by running:
processor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
model = HubertForSpeechClassification.from_pretrained(model_name_or_path, config=config)
model.eval()
file_paths = [os.path.join("test_data", path) for path in os.listdir("test_data")]
path = random.sample(file_paths, 1)[0]
predict(path, processor, model)
Troubleshooting Tips
If you encounter any issues during setup or execution, check the following:
- Library Import Errors: Ensure all necessary packages are installed and up-to-date.
- Audio File Issues: Verify that the audio files in the test_data directory are correctly formatted and accessible.
- Model Loading Failures: Confirm that the model path is typed correctly and that you have an internet connection to download the model.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
Implementing speech emotion recognition using the Hubert model can add significant value to various applications ranging from customer service to mental health assessments. As AI continues to evolve, the ability to comprehend and react to human emotions will pave the way for more empathetic technology!
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

