The Wav2Vec2 model has revolutionized the field of automatic speech recognition (ASR), particularly for low-resource languages like Odia. In this article, we will guide you through the steps to implement the Wav2Vec2-Large-XLSR-53 model specifically for the Odia language. Let’s turn your voice into data!
Getting Started
To begin, you’ll need to ensure that your speech input is sampled at 16 kHz. The fine-tuning process uses the OpenSLR dataset, which is crucial for achieving good results. Follow the steps below for setup and usage.
Setup
- Ensure you have Python installed.
- Install the required libraries: torch, torchaudio, datasets, transformers.
- Download the Odia model from the Hugging Face Model Hub: Wav2Vec2-Large-XLSR-53.
Usage: Fine-Tuning the Model
Now, let’s dive into the code that will help fine-tune the model. Think of fine-tuning as nurturing a plant: you start with a healthy seed (the pre-trained model) and with the right amount of sunlight (data) and water (training), you can grow a robust tree (an efficient speech recognition model).
python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
# Load the dataset
test_dataset = load_dataset("common_voice", "or", split="test[:2%]")
# Load the processor and model
processor = Wav2Vec2Processor.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
model = Wav2Vec2ForCTC.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
# Resample audio
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
Evaluation of the Model
Once your model is trained, you can evaluate its performance using the following code:
python
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
# Load the dataset and evaluation metric
test_dataset = load_dataset("common_voice", "or", split="test")
wer = load_metric("wer")
# Model and processor initialization
processor = Wav2Vec2Processor.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
model = Wav2Vec2ForCTC.from_pretrained("theainerd/wav2vec2-large-xlsr-53-odia")
model.to("cuda")
# Preprocessing
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Sample Output
The result of this evaluation will show you the Word Error Rate (WER), which in this instance is reported as 68.75%. This indicates that while the model shows promise, there may still be room for improvement.
Troubleshooting Your Setup
If you encounter any issues during setup or execution, consider the following troubleshooting tips:
- Ensure all library versions are compatible with one another.
- Check your audio file paths and make sure they point to the correct location.
- Review your sampling rate; it must be set to 16 kHz.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.