Automatic Speech Recognition (ASR) technology has come a long way, providing us with tools that can understand human speech and convert it into text. In this blog, we will learn how to utilize the wav2vec2-large-xls-r-300m-hsb-v3 model, a finely-tuned ASR designed for the Upper Sorbian language (hsb). Let’s dive into the details of evaluation, training, and usage to build our own ASR system.
Understanding the Model
The model we are using is a specialized variation of facebook/wav2vec2-xls-r-300m, enhanced to work with the Mozilla Foundation’s Common Voice 8.0 dataset. It makes use of powerful ASR capabilities to convert spoken language into text efficiently, achieving impressive metrics like:
- Test WER (Word Error Rate): 0.476
- Test CER (Character Error Rate): 0.112
Evaluation of the Model
To evaluate the model, you can use the following commands:
- To evaluate on the Common Voice 8.0 dataset:
python eval.py --model_id DrishtiSharma/wav2vec2-large-xls-r-300m-hsb-v3 --dataset mozilla-foundation/common_voice_8_0 --config hsb --split test --log_outputs
Upper Sorbian (hsb) language not found in speech-recognition-community-v2dev_data!
Training the Model
When it comes to training, several hyperparameters are tuned to achieve optimal performance, like:
- Learning Rate: 0.00045
- Batch Sizes: 16 (train), 8 (eval)
- Number of Epochs: 50
- Optimizer: Adam
- Gradient Accumulation Steps: 2
This may sound a bit technical, but let’s use an analogy to simplify it:
Imagine you are a chef in a kitchen mastering the art of making the perfect soufflé. The learning rate is your attention to detail – too fast and the soufflé won’t rise, too slow and it takes forever to cook. The batch sizes represent how many soufflés you attempt at once – too many and they burn; too few and you can’t serve your guests quickly. The total number of epochs is akin to the number of times you practice the recipe until it’s flawless!
Training Results
After rigorous training, the model displays its performance over epochs, showcasing how the training loss decreases and validation metrics, such as WER, improve. Here’s a brief look at the results:
Epoch: 1 - Training Loss: 8.8951, WER: 1.000
Epoch: 5 - Training Loss: 0.7994, WER: 0.7529
Epoch: 50 - Training Loss: 0.6549, WER: 0.4827
Troubleshooting
If you encounter issues during evaluation or training, here are some ideas:
- Check the compatibility of your Python environment with Torch and Transformers versions.
- Ensure you have the necessary datasets downloaded and accessible.
- If errors arise, consider examining the model loading code and configuration parameters.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
In this post, we’ve explored how to leverage the wav2vec2 model for Upper Sorbian ASR, along with evaluation and training techniques. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

