Welcome to the world of automatic speech recognition! In this guide, we’ll explore how to utilize the fine-tuned XLSR-53 model specifically for recognizing Persian speech. Don’t worry if you’re new to programming or AI; we’ll walk you through every step in a user-friendly manner!
Understanding the Model
The XLSR-53 model, created by Jonatas Grosman, leverages the power of Wav2Vec2, a cutting-edge technology developed by Facebook. Think of this model as a well-trained translator who listens to Persian conversations and transcribes them accurately—except it doesn’t get tired and can handle any amount of audio data you throw at it!
Requirements
- Python installed on your machine.
- The HuggingSound library.
- Audio files in .mp3 or .wav format sampled at 16kHz.
How to Use the Model
To get started, you can either use the HuggingSound library or write your own inference script. Let’s see how these methods work:
1. Using HuggingSound Library
Here’s how you can easily transcribe audio files using the library:
from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-persian")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = model.transcribe(audio_paths)
2. Writing Your Own Inference Script
If you crave more control, you can write a custom script:
import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
LANG_ID = "fa"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-persian"
SAMPLES = 5
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = batch["sentence"].upper()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
for i, predicted_sentence in enumerate(predicted_sentences):
print("-" * 100)
print("Reference:", test_dataset[i]["sentence"])
print("Prediction:", predicted_sentence)
In the code, think of each piece as different ingredients to a recipe: the audio files are the main components, the model serves as the mixing bowl, and the processing functions are the chef who makes sure everything is combined perfectly!
Evaluating the Model
Once you’ve transcribed the audio, you may want to assess the accuracy of the model’s predictions. To do this, compare your predictions to the original sentences using metrics like Word Error Rate (WER) and Character Error Rate (CER). You can achieve this with another script:
import torch
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
lang_id = "fa"
model_id = "jonatasgrosman/wav2vec2-large-xlsr-53-persian"
device = "cuda"
test_dataset = load_dataset("common_voice", lang_id, split="test")
wer = load_metric("wer.py")
cer = load_metric("cer.py")
processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id).to(device)
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to(device)).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]
print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
This evaluation step is like grading an exam; it helps you see how well the model performed by comparing its answers to the correct ones.
Troubleshooting Tips
If you run into any issues while using the model, consider the following troubleshooting tips:
- Ensure your audio samples are exactly 16kHz. Any deviations may lead to unexpected results.
- Check if the library versions are up to date; sometimes compatibility issues can cause errors.
- If you receive errors related to memory, reduce the batch size in the evaluation script.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
With this guide, you’re equipped to dive into the exciting world of speech recognition in Persian using the XLSR-53 model! Remember that practice makes perfect, and don’t hesitate to tweak the code to fit your needs. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

