WhisperX: Enhancing Automatic Speech Recognition

Category :

GitHub stars
GitHub issues
GitHub license
ArXiv paper
Twitter

whisperx-architectural-diagram

What is WhisperX?

WhisperX is a groundbreaking tool for automatic speech recognition (ASR) that provides improved timestamp accuracy and speaker diarization. Its impressive capabilities allow users to transcribe speech in real-time at an astounding speed of 70x, particularly when using the large-v2 model. The tool utilizes advanced technologies like phoneme alignment and voice activity detection (VAD) to enhance the quality of transcription, making it a versatile solution for various audio processing needs.

Highlights

Setup

To make the most out of WhisperX, follow these setup instructions:

  • Tested for PyTorch 2.0 and Python 3.10. Ensure the installation of NVIDIA libraries (cuBLAS 11.x and cuDNN 8.x).
  • Step 1: Create Python environment
    conda create --name whisperx python=3.10
  • Step 2: Install PyTorch for your system
    conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
  • Step 3: Clone and install WhisperX
    pip install git+https://github.com/m-bain/whisperX.git

Example Usage

Running WhisperX for English Speech Transcription

Run Whisper on a sample audio segment to transcribe speech and visualize word timings:

whisperx example sample01.wav --highlight_words True

This will provide enhanced alignment compared to the default Whisper model. For more accuracy, consider using larger models:

whisperx example sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4

Python Usage

For a more programmatic approach, here’s how you can use WhisperX in Python:


import whisperx
import gc

device = 'cuda'
audio_file = 'audio.mp3'
batch_size = 16
compute_type = 'float16'

# 1. Load the model
model = whisperx.load_model('large-v2', device, compute_type=compute_type)

# 2. Load the audio
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 3. Clean up resources
gc.collect()
torch.cuda.empty_cache()
del model

Limitations

  • Transcripts may include words without alignment if they contain unique characters like currencies or dates.
  • Overlapping speech is not handled effectively.
  • Diarization may yield errors, and language-specific models are necessary.

Troubleshooting Tips

Here are some common issues and how to resolve them:

  • Slow Performance: If you encounter slow processing speeds, consider reducing the batch size or switching to a smaller ASR model. This may impact output quality, so find a balance that works for you.
  • Dependency Conflicts: If you face issues with speaker diarization, check for conflicts in your library dependencies, especially regarding pyannote-audio.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Contact Support

If you have queries or need assistance, please reach out to maxhbain@gmail.com.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox

Latest Insights

© 2024 All Rights Reserved

×