How to Use the Fine-tuned XLSR-53 Large Model for Speech Recognition in Persian

Dec 15, 2022 | Educational

Welcome to the world of automatic speech recognition! In this guide, we’ll explore how to utilize the fine-tuned XLSR-53 model specifically for recognizing Persian speech. Don’t worry if you’re new to programming or AI; we’ll walk you through every step in a user-friendly manner!

Understanding the Model

The XLSR-53 model, created by Jonatas Grosman, leverages the power of Wav2Vec2, a cutting-edge technology developed by Facebook. Think of this model as a well-trained translator who listens to Persian conversations and transcribes them accurately—except it doesn’t get tired and can handle any amount of audio data you throw at it!

Requirements

  • Python installed on your machine.
  • The HuggingSound library.
  • Audio files in .mp3 or .wav format sampled at 16kHz.

How to Use the Model

To get started, you can either use the HuggingSound library or write your own inference script. Let’s see how these methods work:

1. Using HuggingSound Library

Here’s how you can easily transcribe audio files using the library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-persian")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = model.transcribe(audio_paths)

2. Writing Your Own Inference Script

If you crave more control, you can write a custom script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fa"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-persian"
SAMPLES = 5

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

In the code, think of each piece as different ingredients to a recipe: the audio files are the main components, the model serves as the mixing bowl, and the processing functions are the chef who makes sure everything is combined perfectly!

Evaluating the Model

Once you’ve transcribed the audio, you may want to assess the accuracy of the model’s predictions. To do this, compare your predictions to the original sentences using metrics like Word Error Rate (WER) and Character Error Rate (CER). You can achieve this with another script:

import torch
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

lang_id = "fa"
model_id = "jonatasgrosman/wav2vec2-large-xlsr-53-persian"
device = "cuda"

test_dataset = load_dataset("common_voice", lang_id, split="test")
wer = load_metric("wer.py")
cer = load_metric("cer.py")

processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id).to(device)

def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to(device)).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

This evaluation step is like grading an exam; it helps you see how well the model performed by comparing its answers to the correct ones.

Troubleshooting Tips

If you run into any issues while using the model, consider the following troubleshooting tips:

  • Ensure your audio samples are exactly 16kHz. Any deviations may lead to unexpected results.
  • Check if the library versions are up to date; sometimes compatibility issues can cause errors.
  • If you receive errors related to memory, reduce the batch size in the evaluation script.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With this guide, you’re equipped to dive into the exciting world of speech recognition in Persian using the XLSR-53 model! Remember that practice makes perfect, and don’t hesitate to tweak the code to fit your needs. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox