How to Fine-tune Wav2Vec2 for Automatic Speech Recognition in Lithuanian

Nov 8, 2021 | Educational

Welcome to our step-by-step guide on fine-tuning the Wav2Vec2 model for Automatic Speech Recognition (ASR) specifically tailored for the Lithuanian language. In this article, we will walk you through the entire process, ensuring a user-friendly navigation through the technical aspects of speech recognition.

Understanding Wav2Vec2

The Wav2Vec2 model, developed by Facebook, is a state-of-the-art tool for interpreting speech. It captures features from audio signals, transforming them into textual representations. Imagine Wav2Vec2 as an expert translator who listens to a foreign language and provides a written translation while learning from their experiences.

Prerequisites

Basic understanding of Python programming.
Familiarity with machine learning concepts.
Access to a GPU for faster processing (optional but recommended).

Installation Guide

To begin your journey, you’ll need to install necessary packages. Open your command line and enter the following commands:

!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa
!pip install jiwer

Setting Up the Normalizer

The normalizer script is essential for pre-processing your audio files. Use the following command to download it:

!wget -O normalizer.py https://huggingface.com/3hrdadfiwav2vec2-large-xlsr-lithuanian/raw/main/normalizer.py

Making Predictions

Next, we will write a Python script to load our model and make predictions on audio samples. This is how you can proceed:

import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset
import numpy as np
import re
import string

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)
    batch['speech'] = speech_array
    return batch

def predict(batch):
    features = processor(batch['speech'], sampling_rate=16_000, return_tensors='pt', padding=True)
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits  
        pred_ids = torch.argmax(logits, dim=-1)
    batch['predicted'] = processor.batch_decode(pred_ids)[0]
    return batch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained('m3hrdadfiwav2vec2-large-xlsr-lithuanian')
model = Wav2Vec2ForCTC.from_pretrained('m3hrdadfiwav2vec2-large-xlsr-lithuanian').to(device)

dataset = load_dataset('common_voice', 'lt', split='test[:1%]')
dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict)

for i in range(len(result)):
    print(f"Reference: {result['sentence'][i]}")
    print(f"Predicted: {result['predicted'][i]}")
    print("---")

Understanding the Code with an Analogy

Think of your script as a sophisticated chef preparing a new dish. Here’s how the components function together:

Input Ingredients (audio files): The chef gathers raw audio samples.
Recipe Instructions (functions): The chef follows specific steps (functions) to prepare the dish (predict results).
Cooking Process (model training): As the chef combines ingredients (data processing and model prediction), the final dish (output text) takes shape, ready to be served to the guests (users).

Evaluating the Model’s Performance

To assess how well your model performs, you might use metrics such as Word Error Rate (WER). Here’s the code to conduct an evaluation:

from datasets import load_metric

wer = load_metric("wer")
wer_score = wer.compute(predictions=result['predicted'], references=result['sentence'])
print(f"Test WER: {wer_score:.2f}")

Troubleshooting Tips

If you encounter any issues while running the scripts, try these troubleshooting steps:

Ensure that your audio files are properly formatted (16kHz).
Check if all required packages are installed correctly.
If you face memory errors, consider reducing the size of the dataset or using a smaller model.
For specific errors, search for solutions on platforms like Stack Overflow.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

In this blog, you learned how to fine-tune the Wav2Vec2 model for Lithuanian speech recognition. With your newly acquired skills, you can explore a wide range of applications, from transcribing calls to enhancing language learning tools.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox