How to Evaluate Automatic Speech Recognition using Common Voice ES Dataset

Jul 9, 2021 | Educational

In this guide, we’ll walk you through the process of evaluating an Automatic Speech Recognition (ASR) model using the Spanish Common Voice dataset. We will employ the Wav2Vec2 model from Hugging Face to achieve this.

Setup the Environment

Before diving into the code, ensure you have the necessary libraries installed:

torchaudio
datasets
transformers
torch

The Concept Explained: An Analogy

Think of the process of evaluating an ASR model as training a parrot to speak. Initially, the parrot listens to various phrases (that’s your audio dataset), then attempts to repeat those phrases (this is your model’s prediction). Finally, you compare what the parrot said to the original phrase to assess how well it learned (that’s the evaluation). Each step has its own importance, just like in our ASR process.

Step-by-Step Guide

1. Import Necessary Libraries

First, we will import all the required libraries for our ASR model:

import torchaudio
from datasets import load_dataset, load_metric
from transformers import (Wav2Vec2ForCTC, Wav2Vec2Processor,)
import torch
import re
import sys

2. Load the Model and Dataset

Next, we will load our model and dataset:

model_name = "facebook/wav2vec2-large-xlsr-53-spanish"
device = "cuda"
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'  # noqa: W605
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_name)
ds = load_dataset("common_voice", "es", split="test", data_dir="./cv-corpus-6.1-2020-12-11")

3. Preprocess the Audio Data

We need to convert our audio data to the desired format:

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    return batch

ds = ds.map(map_to_array)

4. Generate Predictions

Now, it’s time to generate predictions from the model:

def map_to_pred(batch):
    features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = processor.batch_decode(pred_ids)
    batch["target"] = batch["sentence"]
    return batch

result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))

5. Evaluate the Model

Finally, we will evaluate the model using Word Error Rate (WER):

wer = load_metric("wer")
print(wer.compute(predictions=result["predicted"], references=result["target"]))

Results

After running the evaluation, you will get a WER of approximately 17.6%. This means that the model made errors in about 17.6% of the predicted words compared to the actual transcription.

Troubleshooting

If you encounter issues while running the code, consider the following troubleshooting tips:

Ensure that all required libraries are correctly installed.
Verify that your CUDA device is functioning if you are using GPU acceleration.
Check that the dataset path specified in your code is correct.
If you get errors related to audio loading, confirm that your audio files are available and not corrupted.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

In this blog, we have walked through the process of evaluating an automatic speech recognition model using the Common Voice ES dataset. This is a foundational step in working with ASR technologies and will help you build more advanced speech recognition systems.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox