How to Fine-Tune XLSR Wav2Vec2 for Breton Speech Recognition

Jul 7, 2021 | Educational

Welcome to a journey of exploring the powerful world of speech recognition using the XLSR Wav2Vec2 model tailormade for the Breton language. You no longer need to navigate the seas of complex code and technical jargon alone; this guide is here to serve as your compass.

Understanding the Model

The XLSR Wav2Vec2 model, similar to a master chef in a kitchen, is adept at piecing together various ingredients (or data) to produce a delectable dish (speech recognition). In our case, these ingredients are the voice recordings from the Breton Common Voice dataset, which the model utilizes to better comprehend how to interpret and decode spoken Breton language.

Using the Model

Let’s whip up some code to put this fine-tuned model to use! Below is an easy-to-follow recipe for using the model directly:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load test data
test_dataset = load_dataset("common_voice", "br", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-breton")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-breton")

chars_to_ignore_regex = '[\\,\,\?\.\!\;\:\"\“\%\”\�\(\)\/\«\»\½\…]'

# Preprocessing function
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() + " "
    batch["sentence"] = batch["sentence"].replace("ʼ", "'").replace("’", "'").replace('‘', "'")
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

# Process the test dataset
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

# Model prediction
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

Decoding Predictions

Upon running the overhead code, it will produce predictions for the first two samples, unveiling its prowess in understanding Breton:

Prediction: ["ne' ler ket don a-benn us netra pa vez zer nic'hed evel-si", 'an eil hag egile']
Reference: ['"n\'haller ket dont a-benn eus netra pa vezer nec\'het evel-se." ', 'an eil hag egile. '] 

Model Evaluation

Just like taste-testing a new recipe, you’ll want to evaluate the model’s performance. Here’s how you can assess its accuracy with a standard evaluation metric:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load the complete test set
test_dataset = load_dataset("common_voice", "br", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-breton")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-breton")

model.to("cuda")

# Preprocessing function remains same
def speech_file_to_array_fn(batch):
    # ... similar to before ...
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Evaluation function
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Interpreting the Results

The model will output a Word Error Rate (WER) of 41.71%. Just like in cooking, this percentage helps you gauge how well the dish was prepared and how much improvement is possible for your next attempt.

Troubleshooting

If you encounter bumps along the way, here are a few troubleshooting tips:

  • Ensure your audio input is sampled at 16kHz.
  • Make sure you have the necessary libraries installed: torch, torchaudio, and transformers.
  • Check your dataset for any discrepancies or issues.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

The journey to harnessing the power of automatic speech recognition can feel daunting, but with the right resources and guidance, you can successfully fine-tune the XLSR Wav2Vec2 model for Breton. Every error is merely a stepping stone toward perfecting your craft!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox