In the realm of Automatic Speech Recognition (ASR), one of the most exciting developments is the Whisper Small model fine-tuned specifically for the Persian language. This blog will walk you through how to create, train, and evaluate this model utilizing the Mozilla Foundation’s Common Voice dataset. Let’s dive right in!
What is Whisper Small Persian?
The Whisper Small Persian model is an adaptation of the openai/whisper-small model, enhanced through training on the mozilla-foundation/common_voice_11_0 fa dataset.
This model is designed for efficient and accurate speech recognition in Persian, showcasing a Low Word Error Rate (WER) of around 35.51.
Training the Whisper Small Persian Model
When setting up a model like Whisper Small Persian, several hyperparameters and steps are essential to ensure successful training. Here’s a simplified approach:
- Learning Rate: Set to 1e-05, this determines how quickly the model learns.
- Batch Sizes: Use a train_batch_size of 32 and an eval_batch_size of 16. This influences how many data samples the model processes at once.
- Optimizer: Utilize the Adam optimizer for better convergence, with specific betas and epsilon values.
- Training Steps: Aim for 1000 training steps to ensure comprehensive learning.
- Mixed Precision Training: Use Native AMP to speed up training and lower memory usage.
Step-by-Step Training Process
Here’s an overview of the training process with specific metrics to track:
# Training the model
import torch
from transformers import WhisperForConditionalGeneration, WhisperTokenizer
# Load model and tokenizer
model = WhisperForConditionalGeneration.from_pretrained('openai/whisper-small')
tokenizer = WhisperTokenizer.from_pretrained('openai/whisper-small')
# Set training parameters
learning_rate = 1e-05
train_batch_size = 32
eval_batch_size = 16
# Train model (pseudocode)
for epoch in range(1, num_epochs + 1):
train_loss = train_model(model, tokenizer, train_data)
eval_loss, wer = evaluate_model(model, tokenizer, eval_data)
print(f'Epoch: {epoch}, Train Loss: {train_loss}, Eval Loss: {eval_loss}, WER: {wer}')
Think of training the Whisper Small Persian model like teaching a child to recognize speech. You start with specific lessons (data samples) and repeat them in smaller groups (batches). Over time, through various attempts (epochs), the child becomes better at identifying and understanding multiple spoken words (WER), eventually resulting in a refined ability to discern what is being said.
Evaluating the Model
Once training is complete, you evaluate the model’s performance using key metrics:
- Loss: Indicates how well the model is performing during evaluation—in this case, we aim for a Validation Loss of around 0.4278.
- Word Error Rate (WER): A lower WER values indicate better performance. Here, the current WER is approximately 35.51.
Troubleshooting
If you encounter issues during training or evaluation, here are some tips:
- Ensure your dataset is properly formatted and compatible with the model’s requirements.
- Validate if your training parameters (learning rates, batch sizes) are within acceptable ranges.
- Monitor GPU usage and memory; adjust batch sizes if you face out-of-memory errors.
- If the WER is higher than expected, consider re-evaluating your training data quality or increasing training duration.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

