In the world of artificial intelligence, voice recognition technology has made leaps and bounds, especially with models like Whisper by OpenAI. However, what if we want even more precise results, such as word-level timestamps and confidence scores? Enter whisper-timestamped, an extension designed to enhance the Whisper model’s capabilities. This blog will guide you step-by-step through the installation and usage of this powerful tool.
Installation
Before we dive into the implementation, we need to set up the environment correctly. Here’s how to get started:
First Installation
- Ensure you have Python 3.7+, with a recommendation for version 3.9 or higher.
- Install ffmpeg. You can find installation instructions in the Whisper repository.
You can install whisper-timestamped via pip or by cloning the repository:
pip3 install whisper-timestamped
git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped
python3 setup.py install
Additional Packages that Might Be Needed
- To plot alignment between audio timestamps and words, install matplotlib:
pip3 install matplotlib
pip3 install onnxruntime torchaudio
pip3 install transformers
Docker Installation
You can build a Docker image for whisper-timestamped:
git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped
docker build -t whisper_timestamped:latest .
Light Installation for CPU
If you’re working on a CPU without GPU, you can install a lighter version of Torch:
pip3 install torch==1.13.1+cpu torchaudio==0.13.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
Upgrade to the Latest Version
To keep your installation up to date:
pip3 install --upgrade --no-deps --force-reinstall git+https://github.com/linto-ai/whisper-timestamped
Usage
Now that we’re all set up, let’s see how to use whisper-timestamped effectively.
Python Usage
Using the whisper_timestamped Python function to transcribe audio is straightforward:
import whisper_timestamped
audio = whisper.load_audio("AUDIO.wav")
model = whisper.load_model("tiny", device="cpu")
result = whisper.transcribe(model, audio, language="fr")
print(result)
This snippet transcribes audio and includes word-level timestamps and confidence scores with efficient decoding options.
Command Line Usage
You can also use whisper-timestamped in the command line:
whisper_timestamped audio1.flac audio2.mp3 audio3.wav --model tiny --output_dir .
Plot of Word Alignment
To visualize the alignment of words in an audio segment, use:
whisper_timestamped --plot
This will generate a visual representation of the word alignment based on your audio file.
Example Output
The output of the transcribed function will look like this:
{
"text": "Bonjour! Est-ce que vous allez bien?",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.5,
"end": 1.2,
"text": "Bonjour!",
"confidence": 0.51,
"words": [
{
"text": "Bonjour!",
"start": 0.5,
"end": 1.2,
"confidence": 0.51
}
]
}
]
}
Options to Improve Results
There are several options you can enable to enhance transcription quality:
Accurate Whisper Transcription
Use the following command to get better results:
whisper_timestamped --accurate ...
Voice Activity Detection (VAD)
This helps avoid hallucinations in transcriptions. To enable it:
whisper_timestamped --vad True ...
Detecting Disfluencies
To better manage speech irregularities, you can enable disfluency detection:
whisper_timestamped --detect_disfluencies True ...
Troubleshooting
If you run into issues, consider these tips:
- Ensure all dependencies are correctly installed and up to date.
- Check your audio file format; compatible formats are critical for successful transcription.
- Adjust the model settings if you notice discrepancies in transcription quality.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.