How to Perform Automatic Speech Recognition with Whisper

Sep 2, 2023 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_5_125

In the realm of voice technology, harnessing the power of models for automatic speech recognition (ASR) can transform audio data into readable text. Whisper, an influential model for ASR, is set up for various languages, including Cantonese. This blog will guide you on how to utilize the Whisper model effectively using Python.

Step-by-Step Setup for ASR

Let’s dive into a straightforward example to help you grasp the full extent of its capabilities:

Install necessary libraries: Ensure you have installed the required libraries, namely torch, librosa, and the transformers library.
Import the necessary classes: Begin by importing the essential modules from your libraries:

python
import torch
import librosa
from transformers import WhisperProcessor, WhisperTokenizer, WhisperForConditionalGeneration

Configuration and Model Loading

Next, let’s set up the model. Imagine this process like setting the stage for a choir performance: you want everything just right before the voices (or data) are brought in.

# Setup
model_name_or_path = "Oblivion208/whisper-tiny-cantonese"
task = "transcribe"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load the model, tokenizer, and processor
model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path).to(device)
tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, task=task)
processor = WhisperProcessor.from_pretrained(model_name_or_path, task=task)

Feature Extraction and Inference

Once the stage is set, it’s time to tune in the choir! Extract features and generate the transcription from the audio file, as follows:

# Load and process audio file
filepath = "test.wav"
audio, sr = librosa.load(filepath, sr=16000, mono=True)
inputs = processor(audio, sample_rate=sr, return_tensors="pt").to(device)

# Perform inference
with torch.inference_mode():
    generated_tokens = model.generate(
        input_features=inputs.input_features,
        return_dict_in_generate=True,
        max_new_tokens=255,
    )
    transcription = tokenizer.batch_decode(generated_tokens.sequences, skip_special_tokens=True)
    print(transcription)

Understanding Performance Evaluation

After your transcription is generated, you need to evaluate the performance of the models. This can be likened to assessing the quality of the choir’s performance after the show. Here are the approximate performance metrics for various models:

# Metrics Evaluation (Example format)
Model name                       Parameters  Finetune Steps  Time Spend  Training Loss  Validation Loss  CER %
whisper-tiny-cantonese           39 M        3200            4h 34m      0.0485         0.771            11.10
whisper-base-cantonese           74 M        7200            13h 32m     0.0186         0.477            7.66
whisper-small-cantonese          244 M       3600            6h 38m      0.0266         0.137            6.16

Troubleshooting Tips

Should you encounter any hiccups during your ASR journey, here are some helpful troubleshooting ideas:

If you experience issues with model loading, ensure your model name is correctly specified and that you have a stable internet connection to download the model weights.
In case of a runtime error, double-check if you have all necessary libraries installed and whether your GPU is properly set up for conversion. Try switching to CPU if GPU fails.
For performance-related issues, evaluate the model parameters and selected features to ensure they align with your audio data characteristics.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox