How to Fine-Tune the Wav2Vec2 Model for Speech Recognition in Odia

March 26, 2021

The Wav2Vec2 model has revolutionized the field of automatic speech recognition (ASR), particularly for low-resource languages like Odia. In this article, we will guide you through the steps to implement the Wav2Vec2-Large-XLSR-53 model specifically for the Odia language. Let’s turn your voice into data!

Getting Started

To begin, you’ll need to ensure that your speech input is sampled at 16 kHz. The fine-tuning process uses the OpenSLR dataset, which is crucial for achieving good results. Follow the steps below for setup and usage.

Setup

Ensure you have Python installed.
Install the required libraries: torch, torchaudio, datasets, transformers.
Download the Odia model from the Hugging Face Model Hub: Wav2Vec2-Large-XLSR-53.

Usage: Fine-Tuning the Model

Now, let’s dive into the code that will help fine-tune the model. Think of fine-tuning as nurturing a plant: you start with a healthy seed (the pre-trained model) and with the right amount of sunlight (data) and water (training), you can grow a robust tree (an efficient speech recognition model).

python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load the dataset
test_dataset = load_dataset("common_voice", "or", split="test[:2%]")

# Load the processor and model
processor = Wav2Vec2Processor.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
model = Wav2Vec2ForCTC.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")

# Resample audio
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Evaluation of the Model

Once your model is trained, you can evaluate its performance using the following code:

python
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load the dataset and evaluation metric
test_dataset = load_dataset("common_voice", "or", split="test")
wer = load_metric("wer")

# Model and processor initialization
processor = Wav2Vec2Processor.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
model = Wav2Vec2ForCTC.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
model.to("cuda")

# Preprocessing
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
        pred_ids = torch.argmax(logits, dim=-1)
        batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Sample Output

The result of this evaluation will show you the Word Error Rate (WER), which in this instance is reported as 68.75%. This indicates that while the model shows promise, there may still be room for improvement.

Troubleshooting Your Setup

If you encounter any issues during setup or execution, consider the following troubleshooting tips:

Ensure all library versions are compatible with one another.
Check your audio file paths and make sure they point to the correct location.
Review your sampling rate; it must be set to 16 kHz.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.