How to Use a Fine-tuned Wav2Vec2 Model for Speech Recognition in English

Dec 26, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_4_3294

In today’s world of artificial intelligence, speech recognition has become a groundbreaking technology, enabling machines to interpret human speech effectively. In this blog post, we’re diving deep into how to utilize a fine-tuned facebook/wav2vec2-base model to transcribe English audio files using advanced resources and techniques.

Overview of the Model

We’ve fine-tuned the facebook/wav2vec2-base model on English speech using a dataset known as zodata, which comprises 307,912 transcribed voice samples. Out of these, 6,158 samples were used for training and 6,036 for testing, yielding a Word Error Rate (WER) accuracy of 0.340 when transcribing the test audio. It’s essential to ensure audio is sampled at 16kHz for optimum performance.

Setting Up Your Environment

Before running the transcription, you will need to install the necessary libraries such as `transformers`, `datasets`, and `torch`. You can do so using pip:

pip install transformers datasets torch

Using the Model to Transcribe Audio Files

The following steps guide you through using our fine-tuned model as a standalone acoustic model:

Import Required Libraries
Load the Pretrained Model and Processor
Load the Dataset
Tokenize Input Values
Get Transcription

Below is a code snippet illustrating these steps:

python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# Load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("souzanzomodel")
model = Wav2Vec2ForCTC.from_pretrained("souzanzomodel")

# Load dummy dataset and read sound files
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# Tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values

# Batch size 1
# Retrieve logits
logits = model(input_values).logits

# Take argmax to decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Evaluating the Model

You can evaluate the performance of our model by comparing its predictions against the ground truth data. Here’s how:

python
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = Wav2Vec2ForCTC.from_pretrained("souzanzomodel").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("souzanzomodel")

def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])
print("WER:", wer(result["text"], result["transcription"]))

Understanding the Code with an Analogy

Imagine you are trying to teach a child to recognize words by hearing them. You would start by playing some recordings of different people talking (audio samples), then encourage them to repeat what they hear (the training phase). Once they practice enough, you take them into a real-world setting to see how they perform with different voices. Perhaps you have some tools that help them understand the sounds better (the model) and a notebook to write down what they hear (the transcription). The WER (Word Error Rate) acts like a scorecard, telling you how well the child learns compared to the original words. In this process, using clear and distinct recordings at 16kHz would ensure the child learns optimally!

Troubleshooting

If you encounter any issues, here are some troubleshooting tips:

Ensure that your audio files are correctly formatted and sampled at 16kHz.
Check that all required packages are installed and updated to the latest versions.
If the model fails to load, verify that the model name is correctly specified and accessible.
For model-related issues, consider consulting the official documentation of Hugging Face or participating in community forums.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox