How to Use Fine-tuned Hindi XLSR Wav2Vec2 for Automatic Speech Recognition

Apr 10, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_26_1138

Welcome to a user-friendly guide on leveraging the Fine-tuned Hindi XLSR Wav2Vec2 Large model for automatic speech recognition (ASR). This model, trained using the OpenSLR Hindi dataset, is an exciting tool that can help you transcribe spoken Hindi into text seamlessly.

Understanding the Model Setup

Before diving into the practicalities of using this model, let’s unravel its setup with an analogy. Imagine you are trying to teach a child to recognize spoken words. To do so, you would need to provide them with a variety of sounds (like a rich library of audio books). You would also ensure they hear these sounds at the right volume and clarity.

Similarly, our model has been fine-tuned with a dataset that represents real-world Hindi speech, allowing it to ‘understand’ spoken Hindi effectively. Additionally, the audio samples used for training have been carefully treated (upsampled) to ensure the model learns with the best quality possible.

Getting Started with the Model

To utilize the model, you need to follow these simple steps:

Install Required Libraries:: Make sure you have PyTorch, Torchaudio, and Hugging Face Transformers. You can install them using:

pip install torch torchaudio transformers datasets

Load the Model: Use the following Python code to load the model and necessary components:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi")
model = Wav2Vec2ForCTC.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi")
resampler = torchaudio.transforms.Resample(48_000, 16_000)

Preprocess the Dataset: Transform the audio files into a format the model can use:

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

Make Predictions: Now let’s run the model to predict!

inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

Evaluating the Model

To evaluate the model’s performance, you can compute the Word Error Rate (WER) by following these steps:

Load the Evaluation Metric:

wer = load_metric("wer")

Define and Run an Evaluation Function:

def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Troubleshooting

If you encounter any issues during the setup or execution, consider the following troubleshooting tips:

Audio Quality: Ensure that your input audio is sampled at 16 kHz. If you run into errors, check the formatting and resampling.
Model Not Loading: Make sure that the model’s directory is correct and you have a stable internet connection to download the model.
CUDA Errors: If you are using a CUDA-enabled GPU, ensure that you have the correct drivers installed and PyTorch configured for GPU use.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

The Fine-tuned Hindi XLSR Wav2Vec2 model is a powerful speech recognition tool that can help with various applications. By following this guide, you are well on your way to incorporating ASR into your projects effectively.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox