Mastering Whisper: A Guide to Automatic Speech Recognition

Jun 13, 2024 | Educational

Welcome to the fascinating world of Automatic Speech Recognition (ASR) with Whisper! In this guide, we will explore how to effectively use the Whisper model for ASR and even dive into troubleshooting. We’ll break it down step-by-step, so whether you’re a seasoned developer or just starting your journey, you’ll find it user-friendly and insightful.

What is Whisper?

Whisper is a powerful pre-trained model designed for automatic speech recognition and speech translation. It has been trained on a monumental dataset of 680k hours and is capable of generalizing across various datasets without the need for fine-tuning. Think of Whisper as a multilingual interpreter, capable of bridging language gaps through audio!

Getting Started with Whisper

Before we dive into the technical details, imagine Whisper as a talented musician hosting a concert. Each song (or audio input) needs to be understood and played perfectly. Here’s how to set the stage for success.

Step 1: Installation

To get started with Whisper, follow these installation steps. We recommend using Hugging Face’s Transformers library. Here’s how to set it up:


pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets

Step 2: Short-Form Transcription

Whisper can transcribe short audio files (less than 30 seconds) with ease. Let’s see how to do this:


import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True).to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)

print(result["text"])

Understanding the Code

Using the code above is like preparing a meal—you gather ingredients (data), follow a recipe (code), and create a delicious dish (output). The model uses various components (processor, pipeline) to transform audio into text seamlessly.

Step 3: Transcribing Local Audio Files

Transcribing a local audio file is as simple as changing the input. Replace `sample` with your audio file path:


result = pipe("audio.mp3")

Advanced Features

Speech Translation

If you want Whisper to translate speech from one language to another, just specify `”translate”` in the task settings:


result = pipe(sample, generate_kwargs={"task": "translate"})

This is akin to having a tour guide who not only tells you about a place but also translates the local dialect!

Adding Timestamps

To receive timestamps in your output for each sentence or word, use the `return_timestamps` parameter:


result = pipe(sample, return_timestamps=True)
print(result["chunks"])

Just like a director calls the timing for scenes in a movie, these timestamps help you understand exactly when each line was spoken.

Troubleshooting

Even seasoned developers encounter bumps along the way. Here are some common troubleshooting tips:

– Ensure you have the latest libraries installed. Sometimes, outdated packages can lead to compatibility issues.
– Check your audio format. Whisper supports various formats, but if you’re encountering issues, ensure your audio is not corrupted and is in an acceptable format.
– Memory Issues? If you run into memory errors, consider using a smaller model or adjust the parameters in your code.

For more troubleshooting questions/issues, contact our fxis.ai data scientist expert team.

Conclusion

Whisper is a remarkable model that not only simplifies the process of speech recognition but also opens avenues for innovation in multilingual communication. By following the steps outlined in this guide, you can efficiently transcribe and translate audio files with ease.

Remember, practice makes perfect! The more you experiment with Whisper, the better you’ll understand its capabilities and nuances. Happy transcribing!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox