In this article, we’ll explore how to evaluate automatic speech recognition (ASR) models using a dataset known as Common Voice ID. We’ll go step-by-step, making it user-friendly and easy to understand.
Getting Started
To begin our evaluation, we will need to install the required libraries and set up our environment. Make sure you have PyTorch, Transformers, and Torchaudio installed in your Python environment.
Step 1: Importing Libraries
We start by importing the necessary libraries. These libraries will help us load the datasets, model, and perform speech processing.
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import re
import sys
Step 2: Load Our Pre-trained Model
Instead of building a model from scratch, we’ll use a pre-trained model, specifically Wav2Vec2. This is like finding a ready-made puzzle instead of trying to create one from scratch. In our case, the model name is “munggokxlsr_indonesia”.
model_name = "munggokxlsr_indonesia"
device = "cuda" # or "cpu" if CUDA is not available
Step 3: Loading the Dataset
Now, let’s load our dataset, which will be used for evaluation. The dataset is sourced from Common Voice, specifically the ID language version.
ds = load_dataset("common_voice", "id", split="test", data_dir=".cv-corpus-6.1-2020-12-11")
Step 4: Resampling the Audio
Just like how we need to tune our instruments before playing music, speech data needs to be resampled to ensure consistent quality. Here, we will resample the audio from 48 kHz to 16 kHz.
resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)
Step 5: Processing the Data
We need to prepare our data for the model to make sense of it. This involves loading audio files and cleaning the sentences, much like setting up ingredients before cooking.
def map_to_array(batch):
speech, _ = torchaudio.load(batch["path"])
batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
batch["sampling_rate"] = resampler.new_freq
batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).lower().replace("’", "")
return batch
ds = ds.map(map_to_array)
Step 6: Making Predictions
With our processed data in place, it’s time to let our model make predictions on the speech. This is similar to allowing a trained chef to cook based on specified recipes.
def map_to_pred(batch):
features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = processor.batch_decode(pred_ids)
batch["target"] = batch["sentence"]
return batch
result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))
Step 7: Evaluating the Results
Finally, we evaluate the model’s performance using word error rate (WER), which helps us understand how accurately our model is transcribing audio to text.
wer = load_metric("wer")
print(wer.compute(predictions=result["predicted"], references=result["target"]))
Result
After running the above code, you will receive a WER result which reflects the accuracy of your model—typically expressed as a percentage. In our example, the result was 25.7%.
Troubleshooting
- If you encounter an error indicating that the dataset cannot be found, ensure that the path to .cv-corpus-6.1-2020-12-11 is correct.
- In case you experience issues with CUDA, check if your hardware supports GPU processing or switch to “cpu”.
- To resolve import errors, confirm that all required packages (PyTorch, Transformers, Torchaudio) have been installed correctly.
- Ensure that your Python version is compatible with the above libraries.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
By following these steps, you have successfully evaluated an automatic speech recognition model using the Common Voice ID dataset. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.