How to Fine-Tune the Whisper Small Model for Italian Speech Recognition

Dec 9, 2022 | Educational

Welcome to the intricate world of Automatic Speech Recognition (ASR)! Today, we are navigating the waters of fine-tuning the Whisper Small model developed by Gianluca Ruberto to transcribe Italian audio effectively. This process involves adapting a pre-trained model to improve accuracy on specific datasets. Follow along as we break it down step-by-step!

What is Whisper Small?

The Whisper Small model is a fine-tuned version of the openai/whisper-small designed specifically for converting Italian speech into text. Utilizing the Common Voice 11.0 dataset, it excels in the task of speech recognition with significant metrics, including a WER of approximately 22.1%. Let’s dive into how you can effectively fine-tune this model.

Getting Started

Prerequisites: Ensure you have the required libraries, namely Transformers, Pytorch, Datasets, and Tokenizers as specified:

Transformers 4.26.0.dev0
Pytorch 1.12.1+cu113
Datasets 2.7.1
Tokenizers 0.13.2

Training Data and Evaluation

The training process utilizes the initial 10% of the training and validation datasets from the Italian Common Voice 11.0. Evaluation is similarly based on the first 10% of the test dataset.

Training Procedure

The fine-tuning process follows a specific set of hyperparameters designed to optimize our model. Think of it as following a recipe to bake a perfect cake, each ingredient must be precisely adjusted for the best result!

Hyperparameters Overview

Learning Rate: 1e-05
Training Batch Size: 16
Evaluation Batch Size: 8
Random Seed: 42
Optimizer: Adam with beta values of (0.9, 0.999) and epsilon 1e-08
Learning Rate Scheduler Type: linear
Warmup Steps for Scheduler: 500
Total Training Steps: 4000
Mixed Precision Training: Native AMP

Training Results

Here is a snapshot of the training results over various epochs:

Training Loss     Epoch   Step       Validation Loss    WER
0.2545             0.95    1000      0.3872              24.8891
0.129              1.91    2000      0.3682              22.1991
0.0534             2.86    3000      0.3771              22.4695
0.0302             3.82    4000      0.3940              22.1090

Troubleshooting

As you embark on the journey of fine-tuning, you might encounter some bumps along the road. Here are some troubleshooting tips:

If you notice high WER values, consider adjusting the learning rate or increasing the training steps.
In case of unexpected errors, ensure that your environment meets the specified version requirements for the libraries.
Running out of memory during training? Try reducing the batch size or utilizing mixed precision training to optimize resource usage.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Fine-tuning the Whisper Small model can significantly enhance its performance for the Italian language. By adapting the parameters and training it together with a well-selected dataset, you’re well on your way to building a robust ASR system.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox