How to Implement Whisper-Timestamped Multilingual Automatic Speech Recognition

Nov 6, 2020 | Data Science

In the world of artificial intelligence, voice recognition technology has made leaps and bounds, especially with models like Whisper by OpenAI. However, what if we want even more precise results, such as word-level timestamps and confidence scores? Enter whisper-timestamped, an extension designed to enhance the Whisper model’s capabilities. This blog will guide you step-by-step through the installation and usage of this powerful tool.

Installation

Before we dive into the implementation, we need to set up the environment correctly. Here’s how to get started:

First Installation

  • Ensure you have Python 3.7+, with a recommendation for version 3.9 or higher.
  • Install ffmpeg. You can find installation instructions in the Whisper repository.

You can install whisper-timestamped via pip or by cloning the repository:

pip3 install whisper-timestamped
git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped
python3 setup.py install

Additional Packages that Might Be Needed

  • To plot alignment between audio timestamps and words, install matplotlib:
  • pip3 install matplotlib
  • For Voice Activity Detection (VAD), install torchaudio and onnxruntime:
  • pip3 install onnxruntime torchaudio
  • If using fine-tuned Whisper models from Hugging Face, install transformers:
  • pip3 install transformers

Docker Installation

You can build a Docker image for whisper-timestamped:

git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped
docker build -t whisper_timestamped:latest .

Light Installation for CPU

If you’re working on a CPU without GPU, you can install a lighter version of Torch:

pip3 install torch==1.13.1+cpu torchaudio==0.13.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Upgrade to the Latest Version

To keep your installation up to date:

pip3 install --upgrade --no-deps --force-reinstall git+https://github.com/linto-ai/whisper-timestamped

Usage

Now that we’re all set up, let’s see how to use whisper-timestamped effectively.

Python Usage

Using the whisper_timestamped Python function to transcribe audio is straightforward:

import whisper_timestamped
audio = whisper.load_audio("AUDIO.wav")
model = whisper.load_model("tiny", device="cpu")
result = whisper.transcribe(model, audio, language="fr")
print(result)

This snippet transcribes audio and includes word-level timestamps and confidence scores with efficient decoding options.

Command Line Usage

You can also use whisper-timestamped in the command line:

whisper_timestamped audio1.flac audio2.mp3 audio3.wav --model tiny --output_dir .

Plot of Word Alignment

To visualize the alignment of words in an audio segment, use:

whisper_timestamped --plot

This will generate a visual representation of the word alignment based on your audio file.

Example Output

The output of the transcribed function will look like this:

{
  "text": "Bonjour! Est-ce que vous allez bien?",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.5,
      "end": 1.2,
      "text": "Bonjour!",
      "confidence": 0.51,
      "words": [
        {
          "text": "Bonjour!",
          "start": 0.5,
          "end": 1.2,
          "confidence": 0.51
        }
      ]
    }
  ]
}

Options to Improve Results

There are several options you can enable to enhance transcription quality:

Accurate Whisper Transcription

Use the following command to get better results:

whisper_timestamped --accurate ...

Voice Activity Detection (VAD)

This helps avoid hallucinations in transcriptions. To enable it:

whisper_timestamped --vad True ...

Detecting Disfluencies

To better manage speech irregularities, you can enable disfluency detection:

whisper_timestamped --detect_disfluencies True ...

Troubleshooting

If you run into issues, consider these tips:

  • Ensure all dependencies are correctly installed and up to date.
  • Check your audio file format; compatible formats are critical for successful transcription.
  • Adjust the model settings if you notice discrepancies in transcription quality.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox