How to Evaluate Automatic Speech Recognition Models on LibriSpeech

Mar 1, 2021 | Educational

Welcome to the world of Automatic Speech Recognition (ASR), where machines are learning to understand human speech! This blog post will guide you through the process of evaluating a speech recognition model using the LibriSpeech dataset. Let’s get started!

Understanding the Components

Before diving into the code, let’s break down the components we will be using:

LibriSpeech Dataset: A widely used dataset in the ASR domain containing thousands of hours of speech data.
Speech2TextTransformer: A model from the Transformers library designed specifically for converting speech to text.
JIWER Library: A tool used to calculate Word Error Rate (WER), which measures the performance of the ASR system.

The Evaluation Script

Here’s a breakdown of our evaluation script through a fun analogy. Imagine we are baking a cake (our evaluation process), and we have several ingredients (code lines) that come together to create the final product (results of the evaluation).

Let’s look at our ingredients:


from datasets import load_dataset
from transformers import Speech2TextTransformerForConditionalGeneration, Speech2TextTransformerTokenizer
import soundfile as sf
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")  # change to "other" for other test dataset

model = Speech2TextTransformerForConditionalGeneration.from_pretrained("valhalla/s2t_librispeech_large").to("cuda")
tokenizer = Speech2TextTransformerTokenizer.from_pretrained("valhalla/s2t_librispeech_large", do_upper_case=True)

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

librispeech_eval = librispeech_eval.map(map_to_array)

def map_to_pred(batch):
    features = tokenizer(batch["speech"], sample_rate=16000, padding=True, return_tensors="pt")
    input_features = features.input_features.to("cuda")
    attention_mask = features.attention_mask.to("cuda")
    gen_tokens = model.generate(input_ids=input_features, attention_mask=attention_mask)
    batch["transcription"] = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=8, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))

Code Explanation

Now, let’s explain the code analogy step-by-step:

Loading the Ingredients: The first few lines load our dataset and the ASR model. Think of this as gathering our flour, sugar, and eggs.
Preparing the Ingredients: The map_to_array function processes our input audio files, just like mixing our ingredients together before baking.
Baking the Cake: The map_to_pred function takes care of generating predictions. It’s similar to placing our cake in the oven and waiting for it to rise.
Tasting the Cake (Evaluation): Finally, we calculate the WER, which tells us how well our “cake” tasted compared to the expected one!

Running the Evaluation

Simply copy the provided code into your Python environment and ensure you have the necessary libraries installed. This script will evaluate your speech recognition model against the clean LibriSpeech dataset.

Troubleshooting

If you encounter any issues while running the script, here are some troubleshooting ideas:

If you run into dependency errors, make sure all required libraries (like transformers, datasets, soundfile, and jiwer) are installed. You can do this using pip: pip install transformers datasets soundfile jiwer.
Ensure that your device supports CUDA to leverage GPU for faster processing. If you don’t have a GPU, you can run it on CPU by changing .to("cuda") to .to("cpu").
Verify that you have the correct names for the input files; mismatches can lead to errors when trying to read audio files.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Evaluating an Automatic Speech Recognition model using the LibriSpeech dataset can seem daunting, but by following this step-by-step guide, you can achieve insightful results while enjoying the process. Remember, practice makes perfect!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox