WhisperX: Enhancing Automatic Speech Recognition

September 13, 2024

whisperx-architectural-diagram

What is WhisperX?

WhisperX is a groundbreaking tool for automatic speech recognition (ASR) that provides improved timestamp accuracy and speaker diarization. Its impressive capabilities allow users to transcribe speech in real-time at an astounding speed of 70x, particularly when using the large-v2 model. The tool utilizes advanced technologies like phoneme alignment and voice activity detection (VAD) to enhance the quality of transcription, making it a versatile solution for various audio processing needs.

Highlights

Achieved 1st place at Ego4d transcription challenge.
Accepted at INTERSPEECH 2023.
Released version 3 with a 70x speed-up!

Setup

To make the most out of WhisperX, follow these setup instructions:

Tested for PyTorch 2.0 and Python 3.10. Ensure the installation of NVIDIA libraries (cuBLAS 11.x and cuDNN 8.x).

Step 1: Create Python environment

conda create --name whisperx python=3.10

Step 2: Install PyTorch for your system

conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Step 3: Clone and install WhisperX

pip install git+https://github.com/m-bain/whisperX.git

Example Usage

Running WhisperX for English Speech Transcription

Run Whisper on a sample audio segment to transcribe speech and visualize word timings:

whisperx example sample01.wav --highlight_words True

This will provide enhanced alignment compared to the default Whisper model. For more accuracy, consider using larger models:

whisperx example sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4

Python Usage

For a more programmatic approach, here’s how you can use WhisperX in Python:


import whisperx
import gc

device = 'cuda'
audio_file = 'audio.mp3'
batch_size = 16
compute_type = 'float16'

# 1. Load the model
model = whisperx.load_model('large-v2', device, compute_type=compute_type)

# 2. Load the audio
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 3. Clean up resources
gc.collect()
torch.cuda.empty_cache()
del model

Limitations

Transcripts may include words without alignment if they contain unique characters like currencies or dates.
Overlapping speech is not handled effectively.
Diarization may yield errors, and language-specific models are necessary.

Troubleshooting Tips

Here are some common issues and how to resolve them:

Slow Performance: If you encounter slow processing speeds, consider reducing the batch size or switching to a smaller ASR model. This may impact output quality, so find a balance that works for you.
Dependency Conflicts: If you face issues with speaker diarization, check for conflicts in your library dependencies, especially regarding pyannote-audio.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.