How to Fine-Tune Wav2Vec2 for Greek Speech Recognition

Jul 7, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_23_498

This guide aims to help you fine-tune the Wav2Vec2 model for Greek speech recognition using the Common Voice and CSS10 datasets. Whether you’re a novice or seasoned developer, this user-friendly walkthrough will ensure you understand every step of the process.

Setting Up Your Environment

Before diving into the code, ensure you have a proper environment set up with the necessary libraries:

torch: For manipulating tensors
torchaudio: To work with audio files
transformers: For using pre-trained models
datasets: To load the Common Voice dataset

Understanding the Code

The code is quite cohesive, resembling the workings of a skilled chef preparing a delightful Greek dish. Let’s break it down:

python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset('common_voice', 'el', split='test')
processor = Wav2Vec2Processor.from_pretrained('PereLluis13/wav2vec2-large-xlsr-53-greek')
model = Wav2Vec2ForCTC.from_pretrained('PereLluis13/wav2vec2-large-xlsr-53-greek')

# Resampling
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing function
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

inputs = processor(test_dataset['speech'][:2], sampling_rate=16_000, return_tensors='pt', padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)

print('Prediction:', processor.batch_decode(predicted_ids))
print('Reference:', test_dataset['sentence'][:2])

In this script, we load necessary libraries and datasets just like a chef gathering the best ingredients for a special recipe. We preprocess audio files as arrays—like chopping vegetables to make flavors mix better. Finally, we feed the model the prepared inputs and get predictions, akin to tasting a dish to check if the seasoning is just right.

Evaluating the Model

To assess the performance of your model on the Greek Common Voice test data, you can follow the evaluation method detailed in the provided code:

python
from datasets import load_metric
wer = load_metric('wer')

test_dataset = load_dataset('common_voice', 'el', split='test')
processor = Wav2Vec2Processor.from_pretrained('PereLluis13/wav2vec2-large-xlsr-53-greek')
model = Wav2Vec2ForCTC.from_pretrained('PereLluis13/wav2vec2-large-xlsr-53-greek')
model.to('cuda')

chars_to_ignore_regex = '[.,?!-;:“%‘”]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

def evaluate(batch):
    inputs = processor(batch['speech'], sampling_rate=16_000, return_tensors='pt', padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to('cuda'), attention_mask=inputs.attention_mask.to('cuda')).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch['pred_strings'] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print('WER:', ':2f'.format(100 * wer.compute(predictions=result['pred_strings'], references=result['sentence'])))

The evaluation script essentially checks how well the model has learned to understand Greek speech. The WER (Word Error Rate) is calculated, indicating how often the model makes mistakes compared to the actual text—a crucial measure in assessing a model’s effectiveness.

Troubleshooting Common Issues

If you encounter any issues while implementing the above scripts, consider the following troubleshooting ideas:

Ensure your audio files are sampled at 16kHz. Incorrect sampling rates can lead to missed words in recognition.
Check whether all libraries, especially transformers and torchaudio, are up-to-date. An outdated library may result in compatibility issues.
Verify if the needed datasets are correctly loaded. Referencing the wrong dataset may yield unexpected results.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Summary

In this article, we’ve covered the steps necessary to fine-tune the Wav2Vec2 model for Greek speech recognition. From preparing your environment and understanding the code, to evaluating model performance, we hope you feel empowered to take on this project. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox