In this guide, we’ll walk you through the process of evaluating an Automatic Speech Recognition (ASR) model using the Spanish Common Voice dataset. We will employ the Wav2Vec2 model from Hugging Face to achieve this.
Setup the Environment
Before diving into the code, ensure you have the necessary libraries installed:
- torchaudio
- datasets
- transformers
- torch
The Concept Explained: An Analogy
Think of the process of evaluating an ASR model as training a parrot to speak. Initially, the parrot listens to various phrases (that’s your audio dataset), then attempts to repeat those phrases (this is your model’s prediction). Finally, you compare what the parrot said to the original phrase to assess how well it learned (that’s the evaluation). Each step has its own importance, just like in our ASR process.
Step-by-Step Guide
1. Import Necessary Libraries
First, we will import all the required libraries for our ASR model:
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (Wav2Vec2ForCTC, Wav2Vec2Processor,)
import torch
import re
import sys
2. Load the Model and Dataset
Next, we will load our model and dataset:
model_name = "facebook/wav2vec2-large-xlsr-53-spanish"
device = "cuda"
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]' # noqa: W605
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_name)
ds = load_dataset("common_voice", "es", split="test", data_dir="./cv-corpus-6.1-2020-12-11")
3. Preprocess the Audio Data
We need to convert our audio data to the desired format:
resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)
def map_to_array(batch):
speech, _ = torchaudio.load(batch["path"])
batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
batch["sampling_rate"] = resampler.new_freq
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
return batch
ds = ds.map(map_to_array)
4. Generate Predictions
Now, it’s time to generate predictions from the model:
def map_to_pred(batch):
features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = processor.batch_decode(pred_ids)
batch["target"] = batch["sentence"]
return batch
result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))
5. Evaluate the Model
Finally, we will evaluate the model using Word Error Rate (WER):
wer = load_metric("wer")
print(wer.compute(predictions=result["predicted"], references=result["target"]))
Results
After running the evaluation, you will get a WER of approximately 17.6%. This means that the model made errors in about 17.6% of the predicted words compared to the actual transcription.
Troubleshooting
If you encounter issues while running the code, consider the following troubleshooting tips:
- Ensure that all required libraries are correctly installed.
- Verify that your CUDA device is functioning if you are using GPU acceleration.
- Check that the dataset path specified in your code is correct.
- If you get errors related to audio loading, confirm that your audio files are available and not corrupted.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
In this blog, we have walked through the process of evaluating an automatic speech recognition model using the Common Voice ES dataset. This is a foundational step in working with ASR technologies and will help you build more advanced speech recognition systems.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

