In this guide, we will explore how to fine-tune the Wav2Vec2 model, specifically the Wav2Vec2-Large-XLSR-53-Swedish, on the Common Voice Swedish dataset. This model enables remarkable capabilities in automatic speech recognition (ASR) and has demonstrated effective results. Let’s go step by step in understanding how to utilize it.
Prerequisites
Before getting started, ensure you have the following:
- Python installed (preferably Python 3.6 or newer).
- The
torch
andtorchaudio
libraries. - The
datasets
library. - The
transformers
library.
Loading the Dataset
First, we will load the Common Voice dataset:
from datasets import load_dataset
test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]")
In this snippet, we instruct the system to load the test part of the Common Voice dataset for Swedish, taking only 2% of it for our quick tests.
Preprocessing Audio Input
Next, we will ensure that our audio data is correctly formatted and preprocessed:
import torchaudio
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
resampler = torchaudio.transforms.Resample(48_000, 16_000)
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
This function will load the audio files, resample them, and store the processed audio in the batch for further use.
Making Predictions
Now it’s time to make predictions using the fine-tuned model:
import torch
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-swedish")
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-swedish")
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
In this part, we are taking the audio input, processing it through our model, and finally printing out the recognized speech.
Evaluation Metrics
To evaluate how well our model is performing, we will compute the WER (Word Error Rate) and CER (Character Error Rate) based on the results:
from datasets import load_metric
wer_metric = load_metric("wer")
# assuming 'result' contains the predictions and references
wer = wer_metric.compute(predictions=result["pred_strings"], references=result["sentence"])
print(f"WER: {wer:.2f} %")
The lower the error rates, the better our model performs. For our model, the recorded test WER is around 14.70%!
Training the Model
To enhance the model accuracy, we fine-tune it using the training dataset:
mask = [(5 < len(x.split()) < 20) for x in dataset["transcript"].tolist()]
# Fine-tune your model
This mask ensures we include audio samples based on their word count to balance the training process. After multiple training steps, your model will become fine-tuned to recognize Swedish speech with higher accuracy!
Troubleshooting
While implementing this process, you may encounter some troubles. Here are a few tips:
- If you run into errors concerning model loading, double-check the model ID and ensure it exists on the Hugging Face platform.
- If audio preprocessing issues arise, ensure you're pointing to valid audio file paths in your dataset.
- For compatibility issues with sample rates, ensure your audio input is consistently 16kHz.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
Fine-tuning the Wav2Vec2 model for the Swedish ASR task can significantly enhance speech recognition capability, making it a potent tool in speech technology. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.