How to Fine-Tune Wav2Vec2 for Automatic Speech Recognition in Swedish

Apr 13, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_3_1207

In this guide, we will explore how to fine-tune the Wav2Vec2 model, specifically the Wav2Vec2-Large-XLSR-53-Swedish, on the Common Voice Swedish dataset. This model enables remarkable capabilities in automatic speech recognition (ASR) and has demonstrated effective results. Let’s go step by step in understanding how to utilize it.

Prerequisites

Before getting started, ensure you have the following:

Python installed (preferably Python 3.6 or newer).
The torch and torchaudio libraries.
The datasets library.
The transformers library.

Loading the Dataset

First, we will load the Common Voice dataset:

from datasets import load_dataset
test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]")

In this snippet, we instruct the system to load the test part of the Common Voice dataset for Swedish, taking only 2% of it for our quick tests.

Preprocessing Audio Input

Next, we will ensure that our audio data is correctly formatted and preprocessed:

import torchaudio

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    resampler = torchaudio.transforms.Resample(48_000, 16_000)
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

This function will load the audio files, resample them, and store the processed audio in the batch for further use.

Making Predictions

Now it’s time to make predictions using the fine-tuned model:

import torch
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-swedish")
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-swedish")

inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))

In this part, we are taking the audio input, processing it through our model, and finally printing out the recognized speech.

Evaluation Metrics

To evaluate how well our model is performing, we will compute the WER (Word Error Rate) and CER (Character Error Rate) based on the results:

from datasets import load_metric
wer_metric = load_metric("wer")

# assuming 'result' contains the predictions and references
wer = wer_metric.compute(predictions=result["pred_strings"], references=result["sentence"])
print(f"WER: {wer:.2f} %")

The lower the error rates, the better our model performs. For our model, the recorded test WER is around 14.70%!

Training the Model

To enhance the model accuracy, we fine-tune it using the training dataset:

mask = [(5 < len(x.split()) < 20) for x in dataset["transcript"].tolist()]
# Fine-tune your model

This mask ensures we include audio samples based on their word count to balance the training process. After multiple training steps, your model will become fine-tuned to recognize Swedish speech with higher accuracy!

Troubleshooting

While implementing this process, you may encounter some troubles. Here are a few tips:

If you run into errors concerning model loading, double-check the model ID and ensure it exists on the Hugging Face platform.
If audio preprocessing issues arise, ensure you're pointing to valid audio file paths in your dataset.
For compatibility issues with sample rates, ensure your audio input is consistently 16kHz.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Fine-tuning the Wav2Vec2 model for the Swedish ASR task can significantly enhance speech recognition capability, making it a potent tool in speech technology. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox