How to Fine-tune the Wav2Vec2-Large-XLSR-53 Model for Vietnamese Speech Recognition

Apr 2, 2021 | Educational

In this guide, we’re going to explore how to utilize the Wav2Vec2-Large-XLSR-53 model to perform automatic speech recognition (ASR) specifically for the Vietnamese language. We will break down the process into manageable steps and include some troubleshooting tips. Let’s get started!

What You Need

Python installed on your machine
The following Python libraries: torch, torchaudio, datasets, and transformers
Audio input sampled at 16kHz

Setting Up Your Environment

Before diving into the code, make sure that you have all the necessary libraries installed. You can easily install these using pip:

pip install torch torchaudio datasets transformers

Running the Model

Now, let’s get down to business and execute some code!

python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load the test dataset
test_dataset = load_dataset('common_voice', 'vi', split='test')

# Load the Wav2Vec2 processor and model
processor = Wav2Vec2Processor.from_pretrained('not-tanhwav2vec2-large-xslr-53-vietnamese')
model = Wav2Vec2ForCTC.from_pretrained('not-tanhwav2vec2-large-xslr-53-vietnamese')

# Resampling the audio
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocess the audio files
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Prepare inputs for the model
inputs = processor(test_dataset['speech'][:2], sampling_rate=16_000, return_tensors='pt', padding=True)

# Get predictions from the model
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset['sentence'][:2])

Understanding the Code: An Analogy

Think of the code like a recipe for cooking a delicious Vietnamese dish:

The ingredients (like torch, torchaudio, etc.) need to be gathered beforehand.
You first prepare (load) your ingredients (test dataset and processor/model) before cooking up your dish (making predictions).
The resampler is similar to ensuring all your vegetables are chopped uniformly; this helps in cooking evenly (handling input audio).
Finally, the output from the model corresponds to how well the dish turned out, and comparing it with the recipe (the reference) helps evaluate the quality of the cooking!

Evaluating the Model

Once you’ve made some predictions, you may want to gauge how well your model is performing. To do this, simply run the following code:

python
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

# Load the test dataset and WER metric
test_dataset = load_dataset('common_voice', 'vi', split='test')
wer = load_metric('wer')

# Load processor and model
processor = Wav2Vec2Processor.from_pretrained('not-tanhwav2vec2-large-xslr-53-vietnamese')
model = Wav2Vec2ForCTC.from_pretrained('not-tanhwav2vec2-large-xslr-53-vietnamese')
model.to('cuda')

chars_to_ignore_regex = r'[,?.!-;:“%�]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Function to preprocess audio files
def speech_file_to_array_fn(batch):
    batch['sentence'] = re.sub(chars_to_ignore_regex, '', batch['sentence']).lower()
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Evaluate the model
def evaluate(batch):
    inputs = processor(batch['speech'], sampling_rate=16_000, return_tensors='pt', padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to('cuda'), attention_mask=inputs.attention_mask.to('cuda')).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch['pred_strings'] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}".format(100 * wer.compute(predictions=result['pred_strings'], references=result['sentence'])))

Troubleshooting Tips

If you encounter issues while running this model, consider the following troubleshooting ideas:

Ensure all libraries are correctly installed. Missing libraries can lead to errors.
Verify that your audio input is sampled at the required 16kHz.
If you face memory issues, adjust batch sizes or move computations to CPU.
For additional help or feedback, you can connect with others working in AI at fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Conclusion

Now you are equipped with the knowledge necessary to fine-tune the Wav2Vec2-Large-XLSR-53 model for automatic speech recognition in Vietnamese! Happy coding and enjoy exploring the exciting world of speech recognition.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox