How to Fine-Tune Wav2Vec2 for Automatic Speech Recognition in Breton

Jul 9, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_3_1026

In this post, we’ll walk you through the process of fine-tuning the Wav2Vec2 model for automatic speech recognition (ASR) in the Breton language. Grab your laptop, and let’s dive into the world of speech recognition and discover how to harness the power of Wav2Vec2!

Understanding the Wav2Vec2 Model

Imagine Wav2Vec2 as a highly trained translator at a busy airport who decodes various accents and dialects in real-time. This model has been trained on a plethora of speech data, enabling it to convert audio into text seamlessly. In our case, we are going to focus on fine-tuning this super translator specifically for the Breton language.

Requirements

Python 3.6+
PyTorch
Torchaudio
Transformers library
The Common Voice dataset for Breton

Getting Started

Before we start coding, make sure to install all necessary libraries. You can do this via pip:

pip install torch torchaudio transformers datasets

Fine-Tuning the Model

Now, let’s initialize our model and preprocess our dataset. The key steps involve loading the dataset, resampling audio files, and defining how to handle each audio input. You can think of this as getting your translator ready with the right tools before the first customer arrives.

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load test dataset
test_dataset = load_dataset("common_voice", "br", split="test[:2%]")
# Initialize processor and model
processor = Wav2Vec2Processor.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-breton")
model = Wav2Vec2ForCTC.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-breton")
# Resample to 16kHz
resampler = torchaudio.transforms.Resample(48000, 16000)

# Function to convert speech file to array
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
# Process the input
inputs = processor(test_dataset["speech"][:2], sampling_rate=16000, return_tensors="pt", padding=True)

# Model Prediction
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Evaluating the Model

Once you have your predictions, it’s time to evaluate the performance of your newly tuned translator. To measure success, we use the Word Error Rate (WER) which indicates how many words were incorrectly predicted.

# Load dataset and metric
import re
wer = load_metric("wer")
test_dataset = load_dataset("common_voice", "br", split="test")

# Reuse processor and model initialization
processor = Wav2Vec2Processor.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-breton")
model.to("cuda")

# Clean up input
chars_to_ignore_regex = '[,?.!-;:“%‘”]'
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Evaluation function
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER:", "{:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result

The computed WER for our model stands at 46.49%, indicating a decent performance, though there’s always room for improvement!

Troubleshooting

If you encounter issues during installation or while running the model, here are some common troubleshooting steps:

Ensure that all dependencies are correctly installed and updated.
Check that your audio files are in the right format (16kHz sampling rate).
Verify your GPU is being utilized if you’re working with large datasets to speed up processing.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox