How to Fine-Tune the Wav2Vec2-Large-XLSR-Indonesian Model

Jul 7, 2021 | Educational

The Wav2Vec2-Large-XLSR-Indonesian model is a powerful tool for automatic speech recognition in the Indonesian language. In this guide, we’ll walk through how to utilize this model effectively, and explore its usage, evaluation, and training procedures.

Understanding the Model

This model is a fine-tuned version of the facebook/wav2vec2-large-xlsr-53, tailored specifically for the Indonesian Common Voice dataset. To ensure optimal performance, your speech input must be sampled at 16kHz, which serves as a standard for audio processing in this context.

Usage Instructions

You can use the model directly without the need for an additional language model. Here’s a step-by-step breakdown of how to implement this.

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "id", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-large-xlsr-indonesian")
model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-large-xlsr-indonesian")

# Preprocessing the datasets.
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

Analogy for Model Operation

Think of the Wav2Vec2-Large-XLSR-Indonesian model as a well-trained chef in a bustling restaurant. The chef (model) has a recipe book (data) that tells them how to translate raw ingredients (speech input) into delicious dishes (text output). Before cooking, the chef needs to ensure that all ingredients are in perfect condition, just like the model requires a 16kHz audio sample. The preparation process mirrors the audio preprocessing step, ensuring the ingredients are ready for cooking. Finally, as the chef serves the food (prediction), they also make sure it looks like the dish described in the recipe (reference sentence).

Evaluating the Model

To assess the efficiency of the model on the Indonesian test data, follow these steps:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "id", split="test")
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-large-xlsr-indonesian")
model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-large-xlsr-indonesian")
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\'\”\�]'

def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Training the Model

Training this model involves utilizing the Common Voice training and validation datasets, along with synthetic voice datasets for enhanced performance. For those looking to dive deeper, the training scripts can be found here.

Troubleshooting

If you encounter issues while implementing the model, keep the following tips in mind:

Ensure your audio files are in the correct 16kHz format.
Verify that all necessary packages (e.g., torch, torchaudio, transformers) are properly installed.
Check that you are mapping your datasets correctly.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With tools like the Wav2Vec2-Large-XLSR-Indonesian model, we can push the boundaries of what speech recognition technologies can achieve in different languages. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox