The world of automatic speech recognition has taken a leap with the introduction of Distil-Whisper, a cutting-edge model that balances performance and speed. Today, we’ll explore how to effectively utilize Distil-Whisper for your transcription needs and address common troubleshooting queries that might come up along the way.
Table of Contents
Transformers Usage
To get started with Distil-Whisper, ensure you have the latest version of the Hugging Face 🤗 Transformers library:
pip install --upgrade pip
pip install --upgrade transformers accelerate datasets
Short-Form Transcription
Transcribing audio files shorter than 30 seconds can be done using the pipeline class:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, torch_dtype=torch_dtype, device=device)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Long-Form Transcription
For audio files longer than 30 seconds, you can use the sequential long-form transcription model:
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Library Integrations
Distil-Whisper integrates easily with various libraries for ease of use. Here are a few popular integrations:
Faster-Whisper
Install Faster-Whisper to leverage faster inference:
pip install --upgrade pip
pip install --upgrade git+https://github.com/SYSTRAN/faster-whisper datasets
Transformers.js
Use Distil-Whisper directly in the browser with Transformers.js:
npm i @xenova/transformers
Model Details
Distil-Whisper follows the encoder-decoder architecture, where the encoder converts audio into hidden states, and the decoder generates text tokens. This structure allows for efficient processing while maintaining accuracy.
Troubleshooting Tips
Here are some common issues and solutions:
- Model Not Found Error: Ensure that you have the correct model ID by verifying with the Hugging Face Model Hub.
- Out of Memory Error: Try reducing the batch size or use a GPU with more memory.
- Slow Transcription: Ensure you are using the chunk length parameter for long audio files if speed is crucial.
For further insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

