Fine-Tuning Wav2Vec2 for Ukrainian Speech Recognition: A Comprehensive Guide

Jul 7, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_7_1026

In the increasingly digital world, the ability to understand and transcribe spoken language automatically is becoming essential. Today, we will dive into how to fine-tune the Wav2Vec2 model specifically for Ukrainian speech recognition using the Common Voice dataset. Whether you are a seasoned developer or a novice, this guide will walk you through the entire process step-by-step!

Getting Started

Before we jump in, make sure you have the required libraries installed in your Python environment:

torch
torchaudio
datasets
transformers

Understanding the Code: An Analogy

Think of the process of fine-tuning the Wav2Vec2 model like training a new chef (the AI model) to cook a special dish (Ukrainian speech recognition) using specific ingredients (Common Voice dataset). Here is how it works:

Gather Ingredients: We first gather our ingredients by loading the Common Voice dataset suitable for the Ukrainian language.
Prepare the Kitchen: Next, we set up our kitchen by defining the tools (models and processors) we will need.
Mixing the Recipe: We then process our audio samples, preparing them for cooking.
The Cooking Process: Finally, we feed these ingredients into our chef (the model) to produce a delicious final dish (the transcribed text).

Using the Fine-Tuned Model

After everything is set, we can use the model without needing an additional language model:

python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset('common_voice', 'uk', split='test[:2%]')
processor = Wav2Vec2Processor.from_pretrained('mrm8488/wav2vec2-large-xlsr-53-ukrainian')
model = Wav2Vec2ForCTC.from_pretrained('mrm8488/wav2vec2-large-xlsr-53-ukrainian')
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset['speech'][:2], sampling_rate=16_000, return_tensors='pt', padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    print("Prediction:", processor.batch_decode(predicted_ids))
    print("Reference:", test_dataset['sentence'][:2])

Evaluating Your Model

To gauge how well your model performs, you can evaluate it using the following steps:

python
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset('common_voice', 'uk', split='test')
wer = load_metric('wer')

processor = Wav2Vec2Processor.from_pretrained('mrm8488/wav2vec2-large-xlsr-53-ukrainian')
model = Wav2Vec2ForCTC.from_pretrained('mrm8488/wav2vec2-large-xlsr-53-ukrainian')
model.to('cuda')
chars_to_ignore_regex = '[,?.!-;:“%‘”]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
def speech_file_to_array_fn(batch):
    batch['sentence'] = re.sub(chars_to_ignore_regex, '', batch['sentence']).lower()
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

def evaluate(batch):
    inputs = processor(batch['speech'], sampling_rate=16_000, return_tensors='pt', padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to('cuda'), attention_mask=inputs.attention_mask.to('cuda')).logits
        pred_ids = torch.argmax(logits, dim=-1)
        batch['pred_strings'] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER:", "{:2f}".format(100 * wer.compute(predictions=result['pred_strings'], references=result['sentence'])))

Test Results

Your model’s performance can be gauged with a Word Error Rate (WER), which is calculated to be:

Test Result: 41.82 %

Troubleshooting Tips

If you encounter issues while setting up your environment or running the code, consider the following troubleshooting ideas:

Ensure that your audio input is sampled at 16kHz, as the model requires it.
Verify that all necessary libraries are correctly installed and updated to the latest versions.
Check the paths to your audio files to ensure they are correct.
If errors persist, restarting your Python environment might resolve temporary conflicts.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox