Understanding Distil-Whisper: A High-Performance Speech Recognition Model

Jun 8, 2024 | Educational

The world of automatic speech recognition has taken a leap with the introduction of Distil-Whisper, a cutting-edge model that balances performance and speed. Today, we’ll explore how to effectively utilize Distil-Whisper for your transcription needs and address common troubleshooting queries that might come up along the way.

Table of Contents

Transformers Usage

To get started with Distil-Whisper, ensure you have the latest version of the Hugging Face 🤗 Transformers library:

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets

Short-Form Transcription

Transcribing audio files shorter than 30 seconds can be done using the pipeline class:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, torch_dtype=torch_dtype, device=device)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Long-Form Transcription

For audio files longer than 30 seconds, you can use the sequential long-form transcription model:

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Library Integrations

Distil-Whisper integrates easily with various libraries for ease of use. Here are a few popular integrations:

Faster-Whisper

Install Faster-Whisper to leverage faster inference:

pip install --upgrade pip
pip install --upgrade git+https://github.com/SYSTRAN/faster-whisper datasets

Transformers.js

Use Distil-Whisper directly in the browser with Transformers.js:

npm i @xenova/transformers

Model Details

Distil-Whisper follows the encoder-decoder architecture, where the encoder converts audio into hidden states, and the decoder generates text tokens. This structure allows for efficient processing while maintaining accuracy.

Troubleshooting Tips

Here are some common issues and solutions:

  • Model Not Found Error: Ensure that you have the correct model ID by verifying with the Hugging Face Model Hub.
  • Out of Memory Error: Try reducing the batch size or use a GPU with more memory.
  • Slow Transcription: Ensure you are using the chunk length parameter for long audio files if speed is crucial.

For further insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox