How to Use Wav2Vec2-Large-XLSR-53 for Finnish Speech Recognition

Apr 2, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_1_1207

In this article, we will explore how to utilize the Wav2Vec2-Large-XLSR-53 model, a state-of-the-art automatic speech recognition (ASR) system fine-tuned for Finnish. The model can be easily integrated into your projects, allowing you to convert spoken Finnish into text efficiently.

Prerequisites

Python installed on your machine.
The necessary libraries: PyTorch, torchaudio, and transformers.
Access to datasets: Common Voice and CSS10 Finnish.

Setting Up the Environment

To start using the model, you first need to ensure that your machine is set up correctly.

python -m pip install torch torchaudio transformers datasets

Loading the Model

Now that your environment is set up, let’s load the model and the dataset.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset('common_voice', 'fi', split='test')
processor = Wav2Vec2Processor.from_pretrained('vasilis/wav2vec2-large-xlsr-53-finnish')
model = Wav2Vec2ForCTC.from_pretrained('vasilis/wav2vec2-large-xlsr-53-finnish')

Understanding the Code: An Analogy

Imagine you are a chef preparing a dish. The ingredients you select are crucial to the flavor of the final meal. In the code above:

load_dataset represents your selection of ingredients, where you choose a specific dataset containing Finnish audio.
Wav2Vec2Processor is akin to your cooking utensils — it prepares the input for processing.
Wav2Vec2ForCTC is like the cooking method itself — it processes the audio inputs to derive meaning, just like how cooking techniques transform raw ingredients into a delectable dish.

Processing Audio Input

Before feeding audio to the model, ensure it is sampled at 16kHz. Here’s how to preprocess the data:

import torchaudio

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

Making Predictions

Once your data is preprocessed, it’s time to make predictions!

inputs = processor(test_dataset['speech'][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))

Evaluating the Model

One of the crucial aspects of working with models is evaluating their performance. Below is how you can assess the model’s accuracy using Word Error Rate (WER) and Character Error Rate (CER).

from datasets import load_metric

wer = load_metric('wer')

def evaluate(batch):
    inputs = processor(batch['speech'], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to('cuda'), attention_mask=inputs.attention_mask.to('cuda')).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch['pred_strings'] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:.2f}".format(100 * wer.compute(predictions=result['pred_strings'], references=result['sentence'])))

Troubleshooting

If you encounter any issues, here are some troubleshooting ideas:

Ensure that you have the correct version of Python installed.
Check if all required libraries are properly installed and updated.
Make sure to use audio files with the correct sampling rate of 16kHz.
If you face performance issues, consider reducing the batch size.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

By following these steps, you should now have a comprehensive understanding of how to utilize the Wav2Vec2-Large-XLSR-53 model for Finnish speech recognition. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox