In this blog post, we’ll explore how to implement the fine-tuned Whisper-Medium model developed for Automatic Speech Recognition (ASR) in the German language. Utilizing this model allows you to convert spoken German audio into text effectively, thanks to its training on the Mozilla Common Voice 11.0 dataset.
Getting Started
Before diving into the code, ensure you have the necessary libraries to get started:
Model Overview
The Whisper-Medium model is trained to predict casing and punctuation, making it adept at transcribing spoken language accurately. It achieves a Word Error Rate (WER) of 7.05 on the Common Voice 11.0 dataset. Think of it as a highly-trained assistant at a transcription service, skilled in capturing the nuances of spoken German just as a human would.
Performance
Below are the performances of various models based on their WER scores:
Common Voice 9.0 WERs:
- Whisper Small: 13.0
- Whisper Medium: 8.5
- Whisper Large V2: 6.4
Common Voice 11.0 WERs:
- Whisper Small CV11 German: 11.35
- Whisper Medium CV11 German: 7.05
- Whisper Large V2 CV11 German: 5.76
Usage Instructions
There are two primary methods to utilize the model for transcribing audio: via the Hugging Face 🤗 Pipeline and the low-level API. Here’s how to do both:
Using 🤗 Pipeline
Here’s a step-by-step guide:
import torch
from datasets import load_dataset
from transformers import pipeline
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-cv11-german", device=device)
# Load dataset
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]
# Configure generation options
pipe.model.config.max_length = 225 + 1
pipe.model.config.do_sample = True
pipe.model.config.num_beams = 5
# Run transcription
generated_sentences = pipe(waveform)["text"]
Using 🤗 Low-Level API
This method gives you more granular control:
import torch
import torchaudio
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load model and processor
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-cv11-german").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-cv11-german", language="german", task="transcribe")
# Load dataset
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]
# Resample if necessary
if sample_rate != processor.feature_extractor.sampling_rate:
resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
waveform = resampler(waveform)
# Prepare inputs and generate
inputs = processor(waveform, sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt")
input_features = inputs.input_features.to(device)
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)
# Detokenize and normalize the sentences
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Troubleshooting
If you encounter issues while running the model, consider the following troubleshooting ideas:
- Check if your audio files are sampled at 16KHz; this is crucial for accurate processing.
- Ensure all libraries are correctly installed and up to date.
- Verify device compatibility; make sure your CUDA setup is functioning properly.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
With the Whisper-Medium model at your disposal, transcribing German speech becomes seamless and efficient. This model embodies the future of ASR technology, making it an invaluable tool for developers and researchers alike.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

