Implementing Automatic Speech Recognition with wav2vec2 on the Common Voice Dataset

Mar 28, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_11_341

Automatic Speech Recognition (ASR) systems have revolutionized how we interact with technology. With models like wav2vec2, you can transcribe speech into text seamlessly. In this article, we will walk you through the process of implementing ASR using the wav2vec2-large-xls-r-300m model, specifically tailored for the Punjabi language (pa-IN) using the Mozilla Foundation’s Common Voice dataset.

Getting Started

Before diving into the implementation, let’s set the stage with a little analogy. Think of the wav2vec2 model as a skilled translator in a busy multilingual café. Just like our translator, this model listens carefully to the spoken words (audio signals) and translates them into written words (text) in the Punjabi language. Our set-up will allow this ‘translator’ to work effectively by providing it with the necessary information and data.

Model Information

We will be using the following pre-trained model:

Name: wav2vec2-large-xls-r-300m-pa-IN-dx1
Dataset: Common Voice 8 (for pa-IN)
Test WER: 0.4873
Test CER: 0.1687

Evaluation Commands

Now that we’re acquainted with our model, let’s talk about how to evaluate it:

python eval.py --model_id DrishtiSharma/wav2vec2-large-xls-r-300m-pa-IN-dx1 --dataset mozilla-foundationcommon_voice_8_0 --config pa-IN --split test --log_outputs

Unfortunately, the Punjabi language isn’t available in the speech-recognition-community-v2 development data, so our focus will solely remain on the Common Voice dataset.

Training Hyperparameters

For those who love the nitty-gritty, here are the training hyperparameters that were used:

Learning Rate: 0.0003
Train Batch Size: 16
Eval Batch Size: 8
Seed: 42
Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
LR Scheduler Type: Linear
LR Scheduler Warmup Steps: 1200
Number of Epochs: 100
Mixed Precision Training: Native AMP

Training Results

Here’s how the model fared during training:

Training Loss  Epoch  Step  Validation Loss  Wer
3.4607         9.26   500   2.7746           1.0416
0.3442         18.52  1000  0.9114           0.5911
0.2213         27.78  1500  0.9687           0.5751
0.1242         37.04  2000  1.0204           0.5461
0.0998         46.3   2500  1.0250           0.5233
0.0727         55.56  3000  1.1072           0.5382
0.0605         64.81  3500  1.0588           0.5073
0.0458         74.07  4000  1.0818           0.5069
0.0338         83.33  4500  1.0948           0.5108
0.0223         92.59  5000  1.0986           0.4775

Troubleshooting

As you embark on your ASR journey, you might face some common hurdles. Here are a few troubleshooting tips:

Model Not Loading: Ensure that the model ID is correctly specified and that you have installed all necessary libraries.
Evaluation Errors: Double-check your dataset and paths, verifying that they correspond to the correct configurations.
Performance Issues: Experiment with different hyperparameters, such as learning rate or batch size, to find the optimal settings for your dataset.

For further insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox