How to Use XLSR Wav2Vec2 for Automatic Speech Recognition in Spanish

Apr 23, 2022 | Educational

If you’re looking to harness the power of AI for Automatic Speech Recognition (ASR) in Spanish, you’ve come to the right place! This guide will walk you through the steps to use the XLSR Wav2Vec2 model effectively.

Getting Started with XLSR Wav2Vec2

The XLSR Wav2Vec2 model fine-tuned for Spanish is a robust tool for transcribing audio. It has been trained on the Common Voice dataset, providing impressive accuracy. Here’s how you can set it up and use it:

Step 1: Install Required Libraries

Ensure you have the necessary libraries installed. You can use pip to install them:

pip install asrecognition torch librosa datasets transformers

Step 2: Load the Model

Use the following code snippet to load the XLSR Wav2Vec2 model:

from asrecognition import ASREngine
asr = ASREngine('es', model_path='jonatasgrosman/wav2vec2-large-xlsr-53-spanish')

In this analogy, think of the ASR model as a very skilled translator who is fluent in Spanish and specializes in audio transcriptions. By loading the model, you’re essentially summoning this translator to help you decode your audio files.

Step 3: Transcribe Audio Files

With the model loaded, you can now transcribe audio files. Prepare your audio files in .mp3 or .wav formats. Use the following code:

audio_paths = ['path/to/your_file.mp3', 'path/to/another_file.wav']
transcriptions = asr.transcribe(audio_paths)
print(transcriptions)

This part of the process is like playing the recording for our translator. The translator listens and then provides you with the written text from the audio.

Step 4: Writing Your Inference Script

If you prefer to write your own inference script, here’s how:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = 'es'
MODEL_ID = 'jonatasgrosman/wav2vec2-large-xlsr-53-spanish'
SAMPLES = 10

# Load the dataset
test_dataset = load_dataset('common_voice', LANG_ID, split='test[:SAMPLES]')
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch['path'], sr=16_000)
    batch['speech'] = speech_array
    batch['sentence'] = batch['sentence'].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset['speech'], sampling_rate=16_000, return_tensors='pt', padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print(f"Reference: {test_dataset[i]['sentence']}")
    print(f"Prediction: {predicted_sentence}")

This script processes the audio files and uses the translator to convert them into text, much like how you’d take notes while listening to an interview.

Evaluation of Model

Once you have your transcriptions, you can evaluate the model’s performance using:

bash python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-spanish --dataset mozilla-foundation/common_voice_6_0 --config es --split test

Troubleshooting Ideas

If you encounter any issues, consider the following troubleshooting tips:

  • Make sure your audio files are sampled at 16kHz, as this is required for the model to function correctly.
  • Ensure all libraries are up to date. Use the command: pip install --upgrade for each library.
  • If there’s still an issue, double-check the paths to your audio files. Ensure that they exist and are accessible.
  • For further insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following these instructions, you can effectively leverage the XLSR Wav2Vec2 model for transcribing Spanish audio. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox