Creating a Whisper Medium Romanian Model: A Step-by-Step Guide

Sep 16, 2023 | Educational

If you’re looking to harness the power of AI for automatic speech recognition in the Romanian language, you’re in for a treat with the Whisper Medium model! This blog will walk you through the process of using this fine-tuned model, provide insights on its architecture, and address some common issues you might encounter.

Understanding the Whisper Medium Romanian Model

The Whisper Medium Romanian model is based on the openaiwhisper-medium architecture. It has been fine-tuned on two key datasets: the Common Voice 11.0 dataset and the Romanian speech synthesis corpus. The impressive thing about this model is its low Word Error Rate (WER), achieving a score of just 4.73 on the evaluation set!

Setting Up the Environment

Before diving into the code, ensure you have the following frameworks and versions setup:

  • Transformers: 4.26.0.dev0
  • Pytorch: 1.13.0+cu117
  • Datasets: 2.7.1.dev0
  • Tokenizers: 0.13.2

Using the Whisper Medium Romanian Model

Now, let’s proceed to the code, which acts like a recipe for a delicious dish—each ingredient (line of code) contributes to the final outcome!

python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset
import torch

# Load model and processor
processor = WhisperProcessor.from_pretrained('gigantwhisper-medium-romanian')
model = WhisperForConditionalGeneration.from_pretrained('gigantwhisper-medium-romanian')

# Load dummy dataset and read soundfiles
ds = load_dataset('common_voice', 'ro', split='test', streaming=True)
ds = ds.cast_column('audio', Audio(sampling_rate=16_000))
input_speech = next(iter(ds))['audio']['array']

model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language='ro', task='transcribe')
input_features = processor(input_speech, return_tensors='pt', sampling_rate=16_000).input_features
predicted_ids = model.generate(input_features, max_length=448)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

In the analogy of cooking, think of the imports as your kitchen tools, essential in every recipe. The processor and model act like the chef and the sous-chef, respectively, preparing your input speech. The dataset is like fresh ingredients from the market, collected to create a delightful dish; in this case, the transcription! The final transcription is the meal that completes your culinary project.

Training Procedure

While the Whisper model is powerful out-of-the-box, knowing its training hyperparameters can help tailor it to your specific needs. Here are the vital stats:

  • Learning Rate: 1e-05
  • Training Batch Size: 32
  • Evaluation Batch Size: 32
  • Seed: 42
  • Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
  • LR Scheduler Type: linear
  • LR Scheduler Warmup Steps: 500
  • Training Steps: 5000
  • Mixed Precision Training: Native AMP

Troubleshooting Common Issues

Even the best recipes can go awry! Here are some common issues and how to fix them:

  • Problem: Model not loading correctly.
    Solution: Ensure you have the correct model names and check your internet connection during the download.
  • Problem: Poor transcription quality.
    Solution: Experiment with different datasets or adjust training hyperparameters for better accuracy.
  • Problem: Runtime errors during execution.
    Solution: Make sure library versions are compatible with your code. Check for updates if necessary.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox