How to Fine-Tune the Wav2Vec2-Large-XLSR-53 for Euskera Speech Recognition

Jul 10, 2021 | Educational

Many developers and researchers are venturing into the fascinating world of automatic speech recognition (ASR), particularly for less-resourced languages. One of the most promising models for this task is the Wav2Vec2-Large-XLSR, specifically tailored for Euskera using the Common Voice dataset. In this article, we will guide you step-by-step through the process of utilizing this model, with some insights on troubleshooting along the way.

Getting Started: Prerequisites

  • Ensure you have Python installed on your machine.
  • Install the necessary libraries: PyTorch, Torchaudio, Datasets, and Transformers.
  • Have audio input files that are sampled at 16kHz.

Using the Model

Let’s dive into the heart of our task—using the Wav2Vec2 model directly for speech recognition. Here’s our instructional approach:

python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "eu", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-euskera")
model = Wav2Vec2ForCTC.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-euskera")
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

In this code, we first import all the necessary libraries, load the test dataset, and preprocess our audio files to ensure they are compatible with the model. Now, let’s use an analogy to encapsulate how this works:

Imagine you are a translator at a foreign film festival, where audiences are eager to understand foreign films but speak different languages themselves. The audio (i.e., the speech input) must first be translated to a format the translator understands. In our case, the prepared audio is equivalent to subtitles that the models read and translate into a comprehensible output.

Evaluating Model Performance

To not only use but also evaluate the model’s performance on the Euskera test data, we perform the following:

python
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "eu", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-euskera")
model = Wav2Vec2ForCTC.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-euskera")
model.to("cuda")

chars_to_ignore_regex = '[,?.!-;:“%‘”]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Evaluating the model
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
        pred_ids = torch.argmax(logits, dim=-1)
        batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

This evaluates the model and computes the Word Error Rate (WER). Think of WER as the grading system for the translator. The lower the percentage, the more accurate the translation!

Troubleshooting Tips

If you encounter issues while implementing this model, consider the following troubleshooting tips:

  • Ensure that your audio files are uniformly sampled at 16kHz. If not, you might receive errors or inaccurate predictions.
  • Check for any missing libraries or incorrect imports. Every detail counts!
  • Review your input data format to ensure it aligns with what the model expects.
  • Make sure you are utilizing the correct device—GPU or CPU—especially during processing.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Congratulations! You have successfully tackled the intricacies of using Wav2Vec2-Large-XLSR for Euskera speech recognition. This model’s capabilities can be significantly enhanced with quality audio data and ongoing evaluation. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox