In the realm of voice technology, harnessing the power of models for automatic speech recognition (ASR) can transform audio data into readable text. Whisper, an influential model for ASR, is set up for various languages, including Cantonese. This blog will guide you on how to utilize the Whisper model effectively using Python.
Step-by-Step Setup for ASR
Let’s dive into a straightforward example to help you grasp the full extent of its capabilities:
- Install necessary libraries: Ensure you have installed the required libraries, namely
torch,librosa, and thetransformerslibrary. - Import the necessary classes: Begin by importing the essential modules from your libraries:
python
import torch
import librosa
from transformers import WhisperProcessor, WhisperTokenizer, WhisperForConditionalGeneration
Configuration and Model Loading
Next, let’s set up the model. Imagine this process like setting the stage for a choir performance: you want everything just right before the voices (or data) are brought in.
# Setup
model_name_or_path = "Oblivion208/whisper-tiny-cantonese"
task = "transcribe"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Load the model, tokenizer, and processor
model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path).to(device)
tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, task=task)
processor = WhisperProcessor.from_pretrained(model_name_or_path, task=task)
Feature Extraction and Inference
Once the stage is set, it’s time to tune in the choir! Extract features and generate the transcription from the audio file, as follows:
# Load and process audio file
filepath = "test.wav"
audio, sr = librosa.load(filepath, sr=16000, mono=True)
inputs = processor(audio, sample_rate=sr, return_tensors="pt").to(device)
# Perform inference
with torch.inference_mode():
generated_tokens = model.generate(
input_features=inputs.input_features,
return_dict_in_generate=True,
max_new_tokens=255,
)
transcription = tokenizer.batch_decode(generated_tokens.sequences, skip_special_tokens=True)
print(transcription)
Understanding Performance Evaluation
After your transcription is generated, you need to evaluate the performance of the models. This can be likened to assessing the quality of the choir’s performance after the show. Here are the approximate performance metrics for various models:
# Metrics Evaluation (Example format)
Model name Parameters Finetune Steps Time Spend Training Loss Validation Loss CER %
whisper-tiny-cantonese 39 M 3200 4h 34m 0.0485 0.771 11.10
whisper-base-cantonese 74 M 7200 13h 32m 0.0186 0.477 7.66
whisper-small-cantonese 244 M 3600 6h 38m 0.0266 0.137 6.16
Troubleshooting Tips
Should you encounter any hiccups during your ASR journey, here are some helpful troubleshooting ideas:
- If you experience issues with model loading, ensure your model name is correctly specified and that you have a stable internet connection to download the model weights.
- In case of a runtime error, double-check if you have all necessary libraries installed and whether your GPU is properly set up for conversion. Try switching to CPU if GPU fails.
- For performance-related issues, evaluate the model parameters and selected features to ensure they align with your audio data characteristics.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Final Thoughts
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

