How to Use the Fine-Tuned Whisper-Medium Model for Automatic Speech Recognition in German

Dec 31, 2022 | Educational

In this blog post, we’ll explore how to implement the fine-tuned Whisper-Medium model developed for Automatic Speech Recognition (ASR) in the German language. Utilizing this model allows you to convert spoken German audio into text effectively, thanks to its training on the Mozilla Common Voice 11.0 dataset.

Getting Started

Before diving into the code, ensure you have the necessary libraries to get started:

Model Overview

The Whisper-Medium model is trained to predict casing and punctuation, making it adept at transcribing spoken language accurately. It achieves a Word Error Rate (WER) of 7.05 on the Common Voice 11.0 dataset. Think of it as a highly-trained assistant at a transcription service, skilled in capturing the nuances of spoken German just as a human would.

Performance

Below are the performances of various models based on their WER scores:

Common Voice 9.0 WERs:

Common Voice 11.0 WERs:

Usage Instructions

There are two primary methods to utilize the model for transcribing audio: via the Hugging Face 🤗 Pipeline and the low-level API. Here’s how to do both:

Using 🤗 Pipeline

Here’s a step-by-step guide:

import torch
from datasets import load_dataset
from transformers import pipeline

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load pipeline
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-cv11-german", device=device)

# Load dataset
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = test_segment["audio"]

# Configure generation options
pipe.model.config.max_length = 225 + 1
pipe.model.config.do_sample = True
pipe.model.config.num_beams = 5

# Run transcription
generated_sentences = pipe(waveform)["text"]

Using 🤗 Low-Level API

This method gives you more granular control:

import torch
import torchaudio
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load model and processor
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-cv11-german").to(device)
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-cv11-german", language="german", task="transcribe")

# Load dataset
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "de", split="test", streaming=True)
test_segment = next(iter(ds_mcv_test))
waveform = torch.from_numpy(test_segment["audio"]["array"])
sample_rate = test_segment["audio"]["sampling_rate"]

# Resample if necessary
if sample_rate != processor.feature_extractor.sampling_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
    waveform = resampler(waveform)

# Prepare inputs and generate
inputs = processor(waveform, sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt")
input_features = inputs.input_features.to(device)
generated_ids = model.generate(inputs=input_features, max_new_tokens=225)

# Detokenize and normalize the sentences
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Troubleshooting

If you encounter issues while running the model, consider the following troubleshooting ideas:

Check if your audio files are sampled at 16KHz; this is crucial for accurate processing.
Ensure all libraries are correctly installed and up to date.
Verify device compatibility; make sure your CUDA setup is functioning properly.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With the Whisper-Medium model at your disposal, transcribing German speech becomes seamless and efficient. This model embodies the future of ASR technology, making it an invaluable tool for developers and researchers alike.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox