How to Fine-tune Wav2Vec2 Large 53 for Hindi and Marathi Speech Recognition

by | Apr 23, 2021 | Educational

If you’re looking to harness the power of Automatic Speech Recognition (ASR) for Hindi and Marathi, this guide will walk you through the steps to fine-tune the Wav2Vec2 model using OpenSLR datasets, specifically SLR64. Let’s dive into the setup, installation, and usage.

Step 1: Setup and Installation

To start your journey, you need to install the required libraries. Begin by executing the following command:

bash
pip install git+https://github.com/huggingface/transformers.git datasets librosa torch==1.7.0 torchaudio==0.7.0 jiwer

Step 2: Download Evaluation Datasets

Next, we need to download the evaluation datasets for Hindi and Marathi. To do this, run the commands below:

bash
wget https://www.openslr.org/resources/103/Marathi_test.zip -P data/marathi
unzip -P K3[2?do9 datamarathiMarathi_test.zip -d data/marathi.tar -xzf datamarathiMarathi_test.tar.gz -C data/marathi

wget https://www.openslr.org/resources/103/Hindi_test.zip -P data/hindi
unzip -P w9I23B* data/hindi/Hindi_test.zip -d data/hindi.tar -xzf data/hindi/Hindi_test.tar.gz -C data/hindi

wget -O test.csv https://filebin.net/snrz6bt13usv8w2e/test_large.csv?t=ps3n99ho
# If the download does not work, paste this link in your browser: https://filebin.net/snrz6bt13usv8w2e/test_large.csv

Step 3: Usage of the Model

Now that we have everything set up, let’s go over how to utilize the model for speech recognition.

First, let’s import the necessary libraries:

python
import torch
import torchaudio
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_metric

Next, load the necessary processor and model:

python
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("tanmaylaud/wav2vec2-large-xlsr-hindi-marathi")
model = Wav2Vec2ForCTC.from_pretrained("tanmaylaud/wav2vec2-large-xlsr-hindi-marathi").to('cuda')

Understanding the Preprocessing Phase

Think of the preprocessing phase as preparing all your ingredients before cooking a recipe. You want everything to be in order before you actually start the cooking process (i.e., running your model).

In our case, we need to read the audio files as arrays and modify them to a coherent format:

python
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"])
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = speech_array[0].numpy()
    batch["sampling_rate"] = sampling_rate
    batch["target_text"] = batch["sentence"]
    batch["speech"] = librosa.resample(np.asarray(batch["speech"]), sampling_rate, 16_000)
    batch["sampling_rate"] = 16_000
    return batch

Step 4: Making Predictions

After preprocessing, the next step is to make predictions using the model:

python
test_data = test_data.map(speech_file_to_array_fn)
inputs = processor(test_data["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_data["text"][:2])

Troubleshooting Tips

If you encounter any issues during this process, consider these troubleshooting ideas:

  • Ensure all dependencies are properly installed. Missing libraries can lead to errors.
  • Check your dataset paths and ensure files are accessible and correctly named.
  • Make sure your audio input is sampled at 16kHz, as this model requires it.
  • For code parsing errors, ensure all syntax is correct, especially while importing libraries.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following these steps, you can successfully fine-tune and use the Wav2Vec2 model for Hindi and Marathi speech recognition. This technology opens up new possibilities for creating interactive applications that require seamless understanding of spoken language.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox