How to Use Wav2Vec2 for Automatic Speech Recognition

Nov 14, 2022 | Educational

Automatic speech recognition (ASR) is a rapidly growing area in artificial intelligence, enabling machines to convert spoken language into written text. One powerful model for ASR is Facebook’s Wav2Vec2, which has been pretrained and fine-tuned on 960 hours of Librispeech audio. This blog will guide you through using this model for transcribing audio files and evaluating its performance.

Setting Up Wav2Vec2

Before diving into the transcribing process, ensure you have the necessary libraries installed. You will need theÂ transformersÂ andÂ datasetsÂ libraries from Hugging Face.

pip install transformers datasets torch jiwer

Transcribing Audio Files

Think of Wav2Vec2 as a skilled interpreter fluent in both your spoken words and written languages. It begins with absorbing the sounds, understanding the nuances, and finally converting them into textâ€”the transcript you wait for! Hereâ€™s how to use it:

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Load example dataset
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# Tokenize and transcribe
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

Evaluating the Model’s Performance

After transcribing, it’s essential to evaluate the quality of your results. The WER (Word Error Rate) is a crucial metric for this purpose, providing insight into the model’s accuracy. A low WER indicates a high-quality transcription.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

# Load the evaluation dataset
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])
print("WER:", wer(result["text"], result["transcription"]))

Understanding the Results

The results from the evaluation code will give you two figures for WERâ€”one for the Clean dataset and one for the Other dataset.

Clean WER: 3.4
Other WER: 8.6

A lower WER signifies a more accurate transcription, providing confidence in the model’s performance.

Troubleshooting Tips

If you encounter any issues while using Wav2Vec2, here are some troubleshooting ideas:

Ensure that the audio input is correctly sampled at 16 kHz, as this is crucial for the model.
Check your environment setup and confirm that all required libraries are properly installed.
If using a GPU, verify that your CUDA environment is correctly configured.
In case of unexpected results or errors, refer to the official documentation or community forums for solutions.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox