Harnessing Automatic Speech Recognition with Whisper: A Step-by-Step Guide

Mar 3, 2024 | Educational

Automatic Speech Recognition (ASR) has become a pivotal technology in our interactions with machines. This guide will walk you through the process of setting up an ASR system using the Whisper model fine-tuned on the CommonVoice dataset specifically for the Hindi language.

What You Will Need

A computer with Python installed
Basic knowledge of command-line interaction
Audio files in Hindi for transcription

Step 1: Install Required Libraries

To get started, you’ll need to install the necessary packages. Open your command line and run the following command:

pip install speechbrain transformers==4.28.0

This command installs both the SpeechBrain toolkit and the transformers library required for our ASR setup.

Step 2: Transcribing Your Audio Files

With the libraries in place, you’re ready to transcribe your audio files. Use the following Python code:

from speechbrain.inference.ASR import WhisperASR

asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-large-v2-commonvoice-hi", savedir="pretrained_models/asr-whisper-large-v2-commonvoice-hi")
asr_model.transcribe_file("speechbrain/asr-whisper-large-v2-commonvoice-hi/example-hi.wav")

This code loads the fine-tuned model and transcribes a sample audio file named example-hi.wav.

Step 3: Running Inference on a GPU

If you have a compatible GPU, you can speed up the inference process. Simply modify the previous code slightly:

asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-large-v2-commonvoice-hi", savedir="pretrained_models/asr-whisper-large-v2-commonvoice-hi", run_opts={"device": "cuda"})

This addition specifies that the computation should occur on a GPU rather than the CPU.

Understanding the Code: An Analogy

Think of the Whisper model as a high-speed train designed to travel on a track that is fashioned with audio data (our audio files). Here’s how the elements of our ASR system come together:

Pretrained Whisper Encoder: The train engine, designed to process sound waves and convert them into a form the system can understand.
Whisper Tokenizer: The conductor, who ensures each sound is categorized correctly as it passes through the system.
Greedy Decoder: The system’s end, resembling the station, where the processed audio gets released as readable text.

In essence, our audio files take the form of passengers boarding this high-speed train, and as they travel through the various components, they’re transformed into comprehensible language by the time they reach their destination.

Troubleshooting Tips

If you encounter any issues, consider the following troubleshooting ideas:

Ensure that all the dependent libraries are correctly installed and updated.
If your transcription seems inaccurate, check the quality of the audio file and ensure it’s in the correct format (16kHz, mono channel).
If you experience problems running inference on your GPU, verify that your drivers are up to date and compatible with the PyTorch version.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Next Steps for Training Your Own Model

If you’re interested in training your own speech recognition model from scratch, here’s a quick overview of the steps:

Clone the SpeechBrain repository:

git clone https://github.com/speechbrain/speechbrain

Navigate to the directory and install the necessary requirements:

cd speechbrain
pip install -r requirements.txt
pip install -e .

Run the training script on your dataset:

cd recipes/CommonVoice/ASR/transformer
python train_with_whisper.py hparams/train_hi_hf_whisper.yaml --data_folder=your_data_folder

Final Thoughts

With the right setup and understanding, deploying an Automatic Speech Recognition model can significantly enhance user interaction capabilities. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

This concludes our guide on setting up an ASR system using Whisper. Happy transcribing!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox