How to Fine-Tune Your Speech Recognition Model Using Whisper and Common Voice Dataset

Sep 2, 2023 | Educational

In the world of artificial intelligence, fine-tuning a model for automatic speech recognition (ASR) using a specific language can be a game-changer. In this article, we will guide you through the process of fine-tuning the Whisper models on the Cantonese dataset from Mozilla’s Common Voice. This method allows you to build a customized model suitable for your needs. Let’s dive in!

Getting Started

Before we begin, ensure you have the following prerequisites:

Access to a GPU (RTX 3090 recommended)
Python installed on your environment
Necessary libraries such as Hugging Face Transformers
Data from Mozilla’s Common Voice 11.0

Understanding the Models

We’ll explore different Whisper models, how to fine-tune them, and compare their performance. Think of this process like tuning a set of musical instruments; each model represents a different instrument, and with fine-tuning, we are adjusting them to harmonize perfectly.

Model Overview

Here is a quick comparison of the models we will analyze:


| Model Name                               | Parameters | Finetune Steps | Time Spend | Training Loss | Validation Loss | CER % |
|------------------------------------------|------------|----------------|------------|---------------|-----------------|-------|
| whisper-tiny-cantonese                  | 39 M       | 3200           | 4h 34m     | 0.0485        | 0.771           | 11.10 |
| whisper-base-cantonese                  | 74 M       | 7200           | 13h 32m    | 0.0186        | 0.477           | 7.66  |
| whisper-small-cantonese                 | 244 M      | 3600           | 6h 38m     | 0.0266        | 0.137           | 6.16  |
| whisper-small-lora-cantonese            | 3.5 M      | 8000           | 21h 27m    | 0.0687        | 0.382           | 7.40  |
| whisper-large-v2-lora-cantonese        | 15 M       | 10000          | 33h 40m    | 0.0046        | 0.277           | 3.77  |

Fine-Tuning Process

To efficiently fine-tune these models, you will need to follow these steps:

Prepare your training data from the Common Voice corpus.
Choose the model that best suits your needs based on the performance comparison.
Set up your training environment, ensuring that the necessary libraries (like transformers) are installed.
Adjust the hyperparameters such as learning rate, batch size, and finetune steps according to your dataset size.
Run the training script and monitor the training and validation loss.

Troubleshooting

During the fine-tuning process, you may face some challenges. Here are a few common issues and how to resolve them:

High Validation Loss: Consider adjusting your learning rate or increasing the number of finetune steps.
Memory Errors: If you’re running out of GPU memory, you can batch your data more efficiently or select a smaller model.
Unexpected Output Quality: Double-check your dataset for any inconsistencies or errors that may affect model performance.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Performance Evaluation

The performance of the models is evaluated using the Character Error Rate (CER), and here’s how the models performed without and with fine-tuning:


| Model Name                              | Original CER % | wo Finetune CER % | Jointly Finetune CER % |
|-----------------------------------------|----------------|--------------------|------------------------|
| whisper-tiny-cantonese                 | 124.03         | 66.85              | 35.87                  |
| whisper-base-cantonese                 | 78.24          | 61.42              | 16.73                  |
| whisper-small-cantonese                | 52.83          | 31.23              | -                      |
| whisper-small-lora-cantonese           | 37.53          | 19.38              | 14.73                  |
| whisper-large-v2-lora-cantonese       | 37.53          | 19.38              | 9.63                   |

As observed, fine-tuning significantly reduces the CER values for most models, showcasing improved accuracy. The whisper-large-v2-lora-cantonese model performs remarkably well with a final CER of 9.63.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Conclusion

Fine-tuning Whisper models on the Common Voice dataset can lead to impressive results in speech recognition for the Cantonese language. Following the guidelines outlined in this article should help you embark on your journey of enhancing ASR models. Good luck and happy coding!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox