How to Fine-tune wav2vec2-large-xlsr-53-German for Speech Recognition

Jul 6, 2021 | Educational

In the age of artificial intelligence, speech recognition is becoming more integral to user experience across various domains. Today, we’re diving into a practical guide on how to fine-tune the wav2vec2-large-xlsr-53-German model for automatic speech recognition (ASR) using the Common Voice dataset.

Prerequisites

Python installed on your machine
PyTorch and Torchaudio libraries
Transformers library from Hugging Face
Access to the Common Voice dataset, specifically the German set

Step 1: Setting Up Your Environment

Before diving into the code, ensure that your environment is set up correctly. Import the necessary libraries as illustrated below:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

Step 2: Loading the Dataset

We will load the German subset of the Common Voice dataset:

test_dataset = load_dataset('common_voice', 'de', split='test[:2%]')

Step 3: Preprocessing the Data

Preprocessing ensures that our audio files are correctly formatted. Think of preprocessing like preparing ingredients before you start cooking—a necessary step for the final dish to turn out well. In this case, we will resample the audio files:

resampler = torchaudio.transforms.Resample(48_000, 16_000)

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

Step 4: Making Predictions

Once we’ve prepped our data, we can now predict the model’s output:

inputs = processor(test_dataset['speech'][:2], sampling_rate=16_000, return_tensors='pt', padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset['sentence'][:2])

Step 5: Evaluating the Model

Finally, we’ll evaluate the model to assess its performance on the test dataset:

wer = load_metric('wer')

def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors='pt', padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to('cuda'), attention_mask=inputs.attention_mask.to('cuda')).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result

Your model’s Word Error Rate (WER) should be around 25.284593 %. This metric helps gauge the accuracy of your speech recognition model.

Troubleshooting

If you experience any issues loading the dataset, ensure you have the correct path and permissions.
For potential errors related to audio input, confirm that your audio files are sampled at 16kHz.
If the model does not seem to be learning well, consider adjusting the batch size or epochs during the training phase.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

By following these steps, you will have a solid foundation to fine-tune and evaluate the wav2vec2-large-xlsr-53-German model effectively. Happy coding!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox