How to Fine-Tune and Evaluate the Hindi XLSR Wav2Vec2 Model

Apr 10, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_3_1139

In the ever-evolving landscape of Automatic Speech Recognition (ASR), fine-tuning robust models to different languages is vital. Today, we’ll explore how to fine-tune the XLSR Wav2Vec2 Large model for Hindi using the OpenSLR dataset and evaluate its performance using the Common Voice Hindi dataset.

Understanding the Model and Dataset

This project focuses on the Hindi language, leveraging the XLSR Wav2Vec2 architecture. The model is pre-trained but requires adaptation to the Hindi language nuances. The model is trained on:

OpenSLR Hindi dataset (10000 samples for training)
Common Voice Hindi dataset for evaluation

After fine-tuning, our model achieved a Word Error Rate (WER) of 46.05%. This means that roughly 46 out of every 100 words were recognized incorrectly, a metric we aim to improve in future iterations.

Analogy: Training a Language Model is Like Learning a New Language

Imagine you’re learning a new language. Initially, you might rely heavily on a textbook (our pre-trained model) that provides a solid foundation. However, to gain fluency, you need to practice speaking with native speakers (fine-tuning with a specific dataset). Over time, you start to sound more natural, just as our model becomes better at recognizing Hindi speech by being exposed to more Hindi audio samples in training.

Steps for Fine-Tuning the Model

Now that we’ve grasped the basics, here’s how you can fine-tune and evaluate the model:

1. Load Necessary Libraries

First, ensure you have the required libraries:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

2. Prepare Your Dataset

Load your datasets using the following code:

test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")

3. Preprocess the Data

The audio files need to be downsampled to 16 kHz:

resampler = torchaudio.transforms.Resample(48_000, 16_000)

Define a function to preprocess the audio files:

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

4. Using the Model for Predictions

Now, let’s utilize the model to make predictions:

model = Wav2Vec2ForCTC.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi")
processor = Wav2Vec2Processor.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi")
inputs = processor(test_dataset[:2]['speech'], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))

Troubleshooting Tips

If you run into issues while fine-tuning or evaluating:

Ensure that your audio input is sampled at 16 kHz, as required by the model.
Check if all required libraries are properly installed in your environment.
If you encounter memory errors, consider using a smaller batch size or a less complex model.
Keep an eye on the WER; if it remains high, you may need to revisit the fine-tuning data or training parameters.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Evaluating the Model Performance

To evaluate the model, you can run the following script:

wer = load_metric("wer")
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result['pred_strings'], references=result['sentence'])))

Conclusion

Fine-tuning a model like Wav2Vec2 for the Hindi language serves as a practical entry point into the realm of ASR. As with any learning process, it is essential to continuously refine your approach based on performance metrics.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox