Welcome to a user-friendly guide on leveraging the Fine-tuned Hindi XLSR Wav2Vec2 Large model for automatic speech recognition (ASR). This model, trained using the OpenSLR Hindi dataset, is an exciting tool that can help you transcribe spoken Hindi into text seamlessly.
Understanding the Model Setup
Before diving into the practicalities of using this model, let’s unravel its setup with an analogy. Imagine you are trying to teach a child to recognize spoken words. To do so, you would need to provide them with a variety of sounds (like a rich library of audio books). You would also ensure they hear these sounds at the right volume and clarity.
Similarly, our model has been fine-tuned with a dataset that represents real-world Hindi speech, allowing it to ‘understand’ spoken Hindi effectively. Additionally, the audio samples used for training have been carefully treated (upsampled) to ensure the model learns with the best quality possible.
Getting Started with the Model
To utilize the model, you need to follow these simple steps:
- Install Required Libraries:: Make sure you have PyTorch, Torchaudio, and Hugging Face Transformers. You can install them using:
- Load the Model: Use the following Python code to load the model and necessary components:
- Preprocess the Dataset: Transform the audio files into a format the model can use:
- Make Predictions: Now let’s run the model to predict!
pip install torch torchaudio transformers datasets
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi")
model = Wav2Vec2ForCTC.from_pretrained("shiwangi27/wave2vec2-large-xlsr-hindi")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])
Evaluating the Model
To evaluate the model’s performance, you can compute the Word Error Rate (WER) by following these steps:
- Load the Evaluation Metric:
- Define and Run an Evaluation Function:
wer = load_metric("wer")
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Troubleshooting
If you encounter any issues during the setup or execution, consider the following troubleshooting tips:
- Audio Quality: Ensure that your input audio is sampled at 16 kHz. If you run into errors, check the formatting and resampling.
- Model Not Loading: Make sure that the model’s directory is correct and you have a stable internet connection to download the model.
- CUDA Errors: If you are using a CUDA-enabled GPU, ensure that you have the correct drivers installed and PyTorch configured for GPU use.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
The Fine-tuned Hindi XLSR Wav2Vec2 model is a powerful speech recognition tool that can help with various applications. By following this guide, you are well on your way to incorporating ASR into your projects effectively.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.