Automatic Speech Recognition (ASR) has become a pivotal technology in our interactions with machines. This guide will walk you through the process of setting up an ASR system using the Whisper model fine-tuned on the CommonVoice dataset specifically for the Hindi language.
What You Will Need
- A computer with Python installed
- Basic knowledge of command-line interaction
- Audio files in Hindi for transcription
Step 1: Install Required Libraries
To get started, you’ll need to install the necessary packages. Open your command line and run the following command:
pip install speechbrain transformers==4.28.0
This command installs both the SpeechBrain toolkit and the transformers library required for our ASR setup.
Step 2: Transcribing Your Audio Files
With the libraries in place, you’re ready to transcribe your audio files. Use the following Python code:
from speechbrain.inference.ASR import WhisperASR
asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-large-v2-commonvoice-hi", savedir="pretrained_models/asr-whisper-large-v2-commonvoice-hi")
asr_model.transcribe_file("speechbrain/asr-whisper-large-v2-commonvoice-hi/example-hi.wav")
This code loads the fine-tuned model and transcribes a sample audio file named example-hi.wav.
Step 3: Running Inference on a GPU
If you have a compatible GPU, you can speed up the inference process. Simply modify the previous code slightly:
asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-large-v2-commonvoice-hi", savedir="pretrained_models/asr-whisper-large-v2-commonvoice-hi", run_opts={"device": "cuda"})
This addition specifies that the computation should occur on a GPU rather than the CPU.
Understanding the Code: An Analogy
Think of the Whisper model as a high-speed train designed to travel on a track that is fashioned with audio data (our audio files). Here’s how the elements of our ASR system come together:
- Pretrained Whisper Encoder: The train engine, designed to process sound waves and convert them into a form the system can understand.
- Whisper Tokenizer: The conductor, who ensures each sound is categorized correctly as it passes through the system.
- Greedy Decoder: The system’s end, resembling the station, where the processed audio gets released as readable text.
In essence, our audio files take the form of passengers boarding this high-speed train, and as they travel through the various components, they’re transformed into comprehensible language by the time they reach their destination.
Troubleshooting Tips
If you encounter any issues, consider the following troubleshooting ideas:
- Ensure that all the dependent libraries are correctly installed and updated.
- If your transcription seems inaccurate, check the quality of the audio file and ensure it’s in the correct format (16kHz, mono channel).
- If you experience problems running inference on your GPU, verify that your drivers are up to date and compatible with the PyTorch version.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Next Steps for Training Your Own Model
If you’re interested in training your own speech recognition model from scratch, here’s a quick overview of the steps:
- Clone the SpeechBrain repository:
- Navigate to the directory and install the necessary requirements:
- Run the training script on your dataset:
git clone https://github.com/speechbrain/speechbrain
cd speechbrain
pip install -r requirements.txt
pip install -e .
cd recipes/CommonVoice/ASR/transformer
python train_with_whisper.py hparams/train_hi_hf_whisper.yaml --data_folder=your_data_folder
Final Thoughts
With the right setup and understanding, deploying an Automatic Speech Recognition model can significantly enhance user interaction capabilities. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
This concludes our guide on setting up an ASR system using Whisper. Happy transcribing!

