How to Evaluate Automatic Speech Recognition Models Using Common Voice ID

Mar 19, 2021 | Educational

In this article, we’ll explore how to evaluate automatic speech recognition (ASR) models using a dataset known as Common Voice ID. We’ll go step-by-step, making it user-friendly and easy to understand.

Getting Started

To begin our evaluation, we will need to install the required libraries and set up our environment. Make sure you have PyTorch, Transformers, and Torchaudio installed in your Python environment.

Step 1: Importing Libraries

We start by importing the necessary libraries. These libraries will help us load the datasets, model, and perform speech processing.

import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import re
import sys

Step 2: Load Our Pre-trained Model

Instead of building a model from scratch, we’ll use a pre-trained model, specifically Wav2Vec2. This is like finding a ready-made puzzle instead of trying to create one from scratch. In our case, the model name is “munggokxlsr_indonesia”.

model_name = "munggokxlsr_indonesia"
device = "cuda"  # or "cpu" if CUDA is not available

Step 3: Loading the Dataset

Now, let’s load our dataset, which will be used for evaluation. The dataset is sourced from Common Voice, specifically the ID language version.

ds = load_dataset("common_voice", "id", split="test", data_dir=".cv-corpus-6.1-2020-12-11")

Step 4: Resampling the Audio

Just like how we need to tune our instruments before playing music, speech data needs to be resampled to ensure consistent quality. Here, we will resample the audio from 48 kHz to 16 kHz.

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

Step 5: Processing the Data

We need to prepare our data for the model to make sense of it. This involves loading audio files and cleaning the sentences, much like setting up ingredients before cooking.

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).lower().replace("â€™", "")
    return batch

ds = ds.map(map_to_array)

Step 6: Making Predictions

With our processed data in place, it’s time to let our model make predictions on the speech. This is similar to allowing a trained chef to cook based on specified recipes.

def map_to_pred(batch):
    features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = processor.batch_decode(pred_ids)
    batch["target"] = batch["sentence"]
    return batch

result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))

Step 7: Evaluating the Results

Finally, we evaluate the model’s performance using word error rate (WER), which helps us understand how accurately our model is transcribing audio to text.

wer = load_metric("wer")
print(wer.compute(predictions=result["predicted"], references=result["target"]))

Result

After running the above code, you will receive a WER result which reflects the accuracy of your model—typically expressed as a percentage. In our example, the result was 25.7%.

Troubleshooting

If you encounter an error indicating that the dataset cannot be found, ensure that the path to .cv-corpus-6.1-2020-12-11 is correct.
In case you experience issues with CUDA, check if your hardware supports GPU processing or switch to “cpu”.
To resolve import errors, confirm that all required packages (PyTorch, Transformers, Torchaudio) have been installed correctly.
Ensure that your Python version is compatible with the above libraries.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following these steps, you have successfully evaluated an automatic speech recognition model using the Common Voice ID dataset. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox