Using Wav2Vec2 for Automatic Speech Recognition in Finnish

Mar 30, 2021 | Educational

This article will guide you through the process of using the Wav2Vec2 Large XLSR-53 model for automatic speech recognition (ASR) specifically tailored for the Finnish language. By fine-tuning this transformer model with datasets like Common Voice and CSS10 Finnish, you can achieve impressive results comparable to those of a well-tuned language model.

Understanding the Model Setup

Imagine you have a super-smart friend who can learn to recognize different languages after listening for just a while. This friend, similar to our Wav2Vec2 model, absorbs audio data during its training phase and comes equipped with the ability to identify Finnish speech accurately. Just like teaching a child different words and sounds—repetition and exposure help them become proficient in understanding spoken language.

How to Use the Model

Let’s dive into the steps of using the model without the additional complexity of language models.

1. Install Required Libraries

First, ensure that you have the necessary Python packages:

pip install torchaudio transformers datasets

2. Load the Model and Dataset

You will need to load the model and the dataset. Here’s how you can do it:


import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset('common_voice', 'fi', split='test[:2%]')
processor = Wav2Vec2Processor.from_pretrained('vasilis/wav2vec2-large-xlsr-53-finnish')
model = Wav2Vec2ForCTC.from_pretrained('vasilis/wav2vec2-large-xlsr-53-finnish')

3. Preprocess the Audio Files

Next, preprocess your audio files. Just like a chef prepares all their ingredients before cooking, you need to ensure the audio is in the right format:


def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

4. Make Predictions

With everything prepared, it’s time to make predictions:


inputs = processor(test_dataset['speech'][:2], sampling_rate=16_000, return_tensors='pt', padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset['sentence'][:2])

Evaluating the Model’s Performance

To evaluate how well our model recognizes the Finnish language, you can use different metrics. Here’s how to do it:


wer = load_metric('wer')
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER:", 100 * wer.compute(predictions=result['pred_strings'], references=result['sentence']))
print("CER:", 100 * wer.compute(predictions=[....join(list(entry)) for entry in result['pred_strings']], 
                  references=[...join(list(entry)) for entry in result['sentence']]))

Troubleshooting

Occasionally, you may encounter issues. Below are some common troubleshooting steps:

If your audio input quality is poor or not sampled at 16kHz, consider using a better mic or recording in a quieter environment.
Ensure you’ve replaced the model ID and other placeholders appropriately according to your setup.
Check that all necessary libraries are installed and up to date.
If you encounter CUDA errors, make sure that your GPU is compatible and that the appropriate drivers are installed.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

In this tutorial, we explored how to set up and use the Wav2Vec2 Large XLSR-53 model for automatic speech recognition in Finnish. By understanding the underlying principles of ASR and experimenting with various datasets, you can harness the potential of this model for your language projects.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox