How to Use Wav2Vec2 Base 960h for Automatic Speech Recognition

Nov 6, 2021 | Educational

Welcome to our guide on implementing the Wav2Vec2 Base 960h model for automatic speech recognition (ASR). This impactful technology, brought to you by Facebook AI, allows machines to understand spoken language with remarkable accuracy. Let’s embark on this journey together!

Getting Started

To begin using the Wav2Vec2 model, you will need to follow a series of steps to set up the environment and convert the necessary pre-trained files.

Installation and Setup

Here’s how you can set everything up:

pip install transformers[sentencepiece]
pip install fairseq -U
git clone https://github.com/huggingface/transformers.git
cp transformers/src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py .
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt -O .wav2vec_small_960h.pt
mkdir dict
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt
mkdir outputs
python convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder_path ./outputs --checkpoint_path .wav2vec_small_960h.pt --dict_path .dict

Just like baking a cake, where different ingredients come together to create a delicious treat, the installation process is all about gathering the right libraries and files to ensure your application can function properly. Each command plays a crucial role, from installing necessary packages to downloading pretrained models.

Using the Model

After the environment is ready, we can start transcribing audio files. Here’s a step-by-step implementation:

from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch

# Load the model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Define function to read in sound file
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

# Load dummy dataset and read sound files
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

# Tokenize and retrieve logits
input_values = tokenizer(ds["speech"][:2], return_tensors="pt", padding="longest").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)

Think of the model as a librarian sorting through piles of books (or audio files). The librarian takes in information (the audio), categorizes it (tokenizes it), and then retrieves the right content (the transcription) that you’re looking for.

Evaluation

To evaluate the performance of the model on the LibriSpeech dataset, you can use the following code:

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import soundfile as sf
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

librispeech_eval = librispeech_eval.map(map_to_array)

def map_to_pred(batch):
    input_values = tokenizer(batch["speech"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))

In this evaluation setup, you are essentially checking the effectiveness of your librarian by comparing the retrieved content against what was originally intended. This is akin to ensuring that the books returned by the librarian are indeed the ones you requested.

Troubleshooting

If you encounter any issues during installation or runtime, consider the following:

Ensure all dependencies are correctly installed. You might need to use pip install --upgrade [package_name] for updates.
Verify that the file paths to the pre-trained models are accurate and accessible.
If you meet any memory issues, consider using smaller batches during evaluation.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following this guide, you will be well on your way to implementing Wav2Vec2 for automatic speech recognition tasks. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox