Welcome to the intricate world of Automatic Speech Recognition (ASR)! Today, we are navigating the waters of fine-tuning the Whisper Small model developed by Gianluca Ruberto to transcribe Italian audio effectively. This process involves adapting a pre-trained model to improve accuracy on specific datasets. Follow along as we break it down step-by-step!
What is Whisper Small?
The Whisper Small model is a fine-tuned version of the openai/whisper-small designed specifically for converting Italian speech into text. Utilizing the Common Voice 11.0 dataset, it excels in the task of speech recognition with significant metrics, including a WER of approximately 22.1%. Let’s dive into how you can effectively fine-tune this model.
Getting Started
- Prerequisites: Ensure you have the required libraries, namely Transformers, Pytorch, Datasets, and Tokenizers as specified:
- Transformers 4.26.0.dev0
- Pytorch 1.12.1+cu113
- Datasets 2.7.1
- Tokenizers 0.13.2
Training Data and Evaluation
The training process utilizes the initial 10% of the training and validation datasets from the Italian Common Voice 11.0. Evaluation is similarly based on the first 10% of the test dataset.
Training Procedure
The fine-tuning process follows a specific set of hyperparameters designed to optimize our model. Think of it as following a recipe to bake a perfect cake, each ingredient must be precisely adjusted for the best result!
Hyperparameters Overview
- Learning Rate:
1e-05 - Training Batch Size:
16 - Evaluation Batch Size:
8 - Random Seed:
42 - Optimizer: Adam with beta values of
(0.9, 0.999)and epsilon1e-08 - Learning Rate Scheduler Type:
linear - Warmup Steps for Scheduler:
500 - Total Training Steps:
4000 - Mixed Precision Training: Native AMP
Training Results
Here is a snapshot of the training results over various epochs:
Training Loss Epoch Step Validation Loss WER
0.2545 0.95 1000 0.3872 24.8891
0.129 1.91 2000 0.3682 22.1991
0.0534 2.86 3000 0.3771 22.4695
0.0302 3.82 4000 0.3940 22.1090
Troubleshooting
As you embark on the journey of fine-tuning, you might encounter some bumps along the road. Here are some troubleshooting tips:
- If you notice high WER values, consider adjusting the learning rate or increasing the training steps.
- In case of unexpected errors, ensure that your environment meets the specified version requirements for the libraries.
- Running out of memory during training? Try reducing the batch size or utilizing mixed precision training to optimize resource usage.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
Fine-tuning the Whisper Small model can significantly enhance its performance for the Italian language. By adapting the parameters and training it together with a well-selected dataset, you’re well on your way to building a robust ASR system.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

