How to Fine-Tune Wav2vec2 Model in Portuguese Using Common Voice Dataset

Jul 20, 2022 | Educational

Welcome to our guide on leveraging the power of artificial intelligence for speech recognition in the Portuguese language! In this blog post, we will explore how to fine-tune the Wav2vec2 model using the Common Voice dataset and other data augmentation techniques. This powerful model has shown promising results in automatic speech recognition (ASR), and you can easily implement it in your own projects.

Getting Started

To kick things off, you need to set up your environment by ensuring you have the necessary libraries. Here’s what you need:

Python
Transformers library
PyTorch
Torchaudio

Once you have these prerequisites, you can begin with fine-tuning the Wav2vec2 model.

Code Implementation

Below is a step-by-step guide to loading the model, processing your dataset, and testing the results. Think of the code as a recipe. Each line is an ingredient building toward a delicious dish of speech recognition.

Start by importing the required libraries:

from transformers import AutoTokenizer, Wav2Vec2ForCTC

This is similar to gathering your pots and pans before you start cooking. Now, let’s load the tokenizer and the model:

tokenizer = AutoTokenizer.from_pretrained('Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-portuguese')

model = Wav2Vec2ForCTC.from_pretrained('Edresson/wav2vec2-large-100k-voxpopuli-ft-Common_Voice_plus_TTS-Dataset_plus_Data_Augmentation-portuguese')

These two lines set the stage for our cooking—now we have the ingredients ready to mix.

Processing the Dataset

After preparing the model, it’s time to process the Common Voice dataset. This is akin to chopping veggies for your dish; the finer they are, the better they mix in.

dataset = load_dataset('common_voice', 'pt', split='test', data_dir='cv-corpus-7.0-2021-07-21')

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

Next, you’ll want to map the audio files in the dataset. Here’s how to do it:

def map_to_array(batch):
    speech, _ = torchaudio.load(batch['path'])
    batch['speech'] = resampler.forward(speech.squeeze(0)).numpy()
    batch['sampling_rate'] = resampler.new_freq
    batch['sentence'] = re.sub(chars_to_ignore_regex, '', batch['sentence']).lower().replace('Ã¢â‚¬â„¢', '')
    return batch
  
ds = dataset.map(map_to_array)

This function is your kitchen assistant, helping you transform your ingredients into an easy-to-handle format.

Testing the Results

Once your data is prepared, it’s essential to test the model and validate its effectiveness. After all, every chef must taste their dish!

result = ds.map(map_to_pred, batched=True, batch_size=1, remove_columns=list(ds.features.keys()))
print(wer.compute(predictions=result['predicted'], references=result['target']))

Finally, print the Word Error Rate (WER) to assess your model’s performance. If it’s around 20.20% as noted in the Common Voice 7.0, you’re doing great!

Troubleshooting

If you encounter issues, here are a few troubleshooting steps:

Ensure all libraries are properly installed and up-to-date.
Check dataset paths and availability.
Validate the data format to avoid mismatches during processing.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Congratulations on fine-tuning the Wav2vec2 model for Portuguese! Your understanding of speech recognition technology is now at a whole new level.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox