How to Use Whisper-Large-V3-French for Automatic Speech Recognition

Feb 9, 2024 | Educational

Welcome to a guide designed to help you harness the power of the Whisper-Large-V3-French model for Automatic Speech Recognition (ASR). This model is finely tuned for the French language and is perfect for tasks such as transcribing audio into text. Let’s delve into the details of how to set up and utilize this cutting-edge technology.

Performance
Usage
Training Details
Acknowledgements

Performance

The Whisper-Large-V3-French model has been rigorously evaluated on various datasets, yielding impressive results. For instance, it achieved a Word Error Rate (WER) of 3.98% on the Multilingual LibriSpeech dataset. These metrics ensure that the model is both robust and accurate when recognizing French speech.

Usage

There are several ways to utilize the Whisper-Large-V3-French model. Below, we explore various approaches, each with its unique advantages.

Hugging Face Pipeline

The easiest way to use the model is through the Hugging Face pipeline. This method facilitates seamless audio transcription:

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

Analogy

Think of utilizing the Hugging Face pipeline like ordering a meal at a restaurant. You pick up a menu (the configuration of your model) and place your order (input audio). The chef (the processing backend) prepares your meal (transcribes the audio), and when it’s ready, the waiter (the pipeline) serves it to you. So simple, right?

Hugging Face Low-level APIs

For more control over the transcription process, you can use the low-level APIs. This approach allows you to customize input handling:

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Extract features
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

# Generate tokens
predicted_ids = model.generate(input_features.to(dtype=torch_dtype).to(device), max_new_tokens=128)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Speculative Decoding

Speculative decoding is another feature that provides faster output. It involves using a simplified version of the model, allowing for quicker response times while maintaining output accuracy. Instructions on how to implement this can be found here.

OpenAI Whisper

If you’re looking for a more detailed level of control, you can utilize the original OpenAI Whisper model. This method utilizes a sequential decoding algorithm for long audio files. First, install the necessary package:

pip install -U openai-whisper

Then download the model and follow the instructions from the repository:

python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='bofenghuang/whisper-large-v3-french', filename='original_model.pt', local_dir='./models/whisper-large-v3-french')"

Faster Whisper

The Faster Whisper is a game-changer. It accelerates the transcription process while reducing memory consumption. Please follow their installation guide from the repository here.

pip install faster-whisper

Whisper.cpp

For those keen on lower-level implementation, Whisper.cpp offers an efficient way to run the model without dependencies. Installation requires building the repository first; you can find the instructions here.

Candle

Candle whispers can be employed to run Whisper in a lightweight ML framework. Clone the Candle repository and follow their example for audio transcription.

MLX

MLX implementation is great for Apple silicon users. Clone the MLX examples from here and transcribe audio efficiently.

Training Details

The model was trained using a diverse dataset comprising over 2,500 hours of French speech data. Some of the datasets include Common Voice 13.0, Multilingual LibriSpeech, and others. The meticulous tuning ensures accuracy and robustness in performance.

Acknowledgements

OpenAI for creating and open-sourcing the Whisper model
Hugging Face for integrating the model and providing the necessary training framework within the Transformers repository
Genci for their incredible contribution of GPU hours

Troubleshooting

If you encounter any issues while using the Whisper-Large-V3-French model, here are some troubleshooting tips:

Check if all necessary libraries are installed and up to date.
Ensure that your audio files are in the correct format; unsupported formats can lead to errors.
If you experience performance issues, try optimizing your hardware configuration or consider using a more powerful GPU.

For detailed insights or collaboration opportunities, don’t hesitate to connect with **[fxis.ai](https://fxis.ai)**.

At **[fxis.ai](https://fxis.ai)**, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox