In the increasingly digital world, the ability to understand and transcribe spoken language automatically is becoming essential. Today, we will dive into how to fine-tune the Wav2Vec2 model specifically for Ukrainian speech recognition using the Common Voice dataset. Whether you are a seasoned developer or a novice, this guide will walk you through the entire process step-by-step!
Getting Started
Before we jump in, make sure you have the required libraries installed in your Python environment:
torch
torchaudio
datasets
transformers
Understanding the Code: An Analogy
Think of the process of fine-tuning the Wav2Vec2 model like training a new chef (the AI model) to cook a special dish (Ukrainian speech recognition) using specific ingredients (Common Voice dataset). Here is how it works:
- Gather Ingredients: We first gather our ingredients by loading the Common Voice dataset suitable for the Ukrainian language.
- Prepare the Kitchen: Next, we set up our kitchen by defining the tools (models and processors) we will need.
- Mixing the Recipe: We then process our audio samples, preparing them for cooking.
- The Cooking Process: Finally, we feed these ingredients into our chef (the model) to produce a delicious final dish (the transcribed text).
Using the Fine-Tuned Model
After everything is set, we can use the model without needing an additional language model:
python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset('common_voice', 'uk', split='test[:2%]')
processor = Wav2Vec2Processor.from_pretrained('mrm8488/wav2vec2-large-xlsr-53-ukrainian')
model = Wav2Vec2ForCTC.from_pretrained('mrm8488/wav2vec2-large-xlsr-53-ukrainian')
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch['path'])
batch['speech'] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset['speech'][:2], sampling_rate=16_000, return_tensors='pt', padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset['sentence'][:2])
Evaluating Your Model
To gauge how well your model performs, you can evaluate it using the following steps:
python
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset('common_voice', 'uk', split='test')
wer = load_metric('wer')
processor = Wav2Vec2Processor.from_pretrained('mrm8488/wav2vec2-large-xlsr-53-ukrainian')
model = Wav2Vec2ForCTC.from_pretrained('mrm8488/wav2vec2-large-xlsr-53-ukrainian')
model.to('cuda')
chars_to_ignore_regex = '[,?.!-;:“%‘”]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
def speech_file_to_array_fn(batch):
batch['sentence'] = re.sub(chars_to_ignore_regex, '', batch['sentence']).lower()
speech_array, sampling_rate = torchaudio.load(batch['path'])
batch['speech'] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
def evaluate(batch):
inputs = processor(batch['speech'], sampling_rate=16_000, return_tensors='pt', padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to('cuda'), attention_mask=inputs.attention_mask.to('cuda')).logits
pred_ids = torch.argmax(logits, dim=-1)
batch['pred_strings'] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER:", "{:2f}".format(100 * wer.compute(predictions=result['pred_strings'], references=result['sentence'])))
Test Results
Your model’s performance can be gauged with a Word Error Rate (WER), which is calculated to be:
Test Result: 41.82 %
Troubleshooting Tips
If you encounter issues while setting up your environment or running the code, consider the following troubleshooting ideas:
- Ensure that your audio input is sampled at 16kHz, as the model requires it.
- Verify that all necessary libraries are correctly installed and updated to the latest versions.
- Check the paths to your audio files to ensure they are correct.
- If errors persist, restarting your Python environment might resolve temporary conflicts.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Final Thoughts
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.