How to Fine-Tune Wav2Vec2 for Automatic Speech Recognition in Thai

Mar 25, 2021 | Educational

The advancement in automatic speech recognition (ASR) has opened up numerous possibilities in the field of natural language processing. One key player in this domain is the Wav2Vec2 model, particularly its variation tailored for the Thai language known as XLSR Wav2Vec2 Large Thai by Sakares. In this article, we will walk through the process of using this model for speech recognition tasks. Let’s dive in!

Setting Up the Model

Before you get started, ensure you have Python and the necessary libraries installed—namely, torch, torchaudio, datasets, and transformers. The goal here is to load the Common Voice dataset in Thai and make predictions on it with our model.

python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from pythainlp.tokenize import word_tokenize

test_dataset = load_dataset("common_voice", "th", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("sakares/wav2vec2-large-xlsr-thai-demo")
model = Wav2Vec2ForCTC.from_pretrained("sakares/wav2vec2-large-xlsr-thai-demo")
resampler = torchaudio.transforms.Resample(48_000, 16_000)

In the code above, you are like a conductor orchestrating a symphony of data: each library plays its part to bring the music of speech recognition to life. The dataset is the musical score, the processor prepares that score for our instrument (the model), and the resampler adjusts the tempo (sampling rate) to ensure that we play harmoniously.

Preprocessing the Data

Now that we have our setup ready, it’s time to preprocess the audio files to make them suitable for processing. The following function does just that:

python
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

Think of preprocessing as tuning your instrument before the performance; without it, the final output would be out of sync and displeasing to the ears. We load audio files, resample them, and convert them into a format ready for our model to learn from.

Making Predictions

Once you have preprocessed your data, you’re ready to make predictions. Let’s see how to do this!

python
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Evaluating the Model

After making predictions, it’s time to evaluate the model’s performance using the Word Error Rate (WER) metric.

python
wer = load_metric("wer")
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}%".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Troubleshooting Tips

If you encounter any issues, consider the following troubleshooting ideas:

Check your audio input sampling rate; it must be 16kHz for the model to work effectively.
Make sure all necessary libraries are installed and updated to the latest versions.
If you receive unexpected results, verify that the data preprocessing steps are correctly implemented.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following these steps, you’ve successfully fine-tuned and evaluated a speech recognition model specifically designed for the Thai language! With continuous practice and exploration of ASR models, you’ll be able to contribute even further to the world of natural language processing.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox