How to Use the Fine-tuned Whisper Model for Russian ASR

Jul 6, 2024 | Educational

If you’re looking to enhance your Automatic Speech Recognition (ASR) capabilities for the Russian language, you’re in the right place. In this guide, we will walk through the steps to utilize the fine-tuned version of the Whisper model, originally developed by OpenAI, specifically designed to better support Russian.

Model Overview

The fine-tuned model you’ll be using is based on openai/whisper-large-v3. It has been trained on the Russian subset of the Common Voice 17.0 dataset, boasting over 200,000 rows. This model has achieved a Word Error Rate (WER) improvement from an original 9.84 to 6.39, thanks to an intensive fine-tuning process that took over 60 hours on dual Tesla A100 GPUs.

Usage Instructions

Before diving into the code, proper preprocessing is key to getting the best results from the ASR model. Here’s how you should go about it:

1. Preprocess Your Audio Records

Start by normalizing the volume of your audio recordings. You can do this using SoX (Sound eXchange) with the command shown below:

bash
sox record.wav -r 16k record-normalized.wav norm -0.5 compand 0.3,1 -90,-90,-70,-70,-60,-20,0,0 -5 0 0.2

2. Setup Your ASR Pipeline

Next, set up your ASR code using the following Python script:

python
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

torch_dtype = torch.bfloat16  # set your preferred type here
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"  
    setattr(torch.distributed, "is_initialized", lambda : False)  # monkey patching
    
device = torch.device(device)

whisper = WhisperForConditionalGeneration.from_pretrained(
    "antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,
    # add attn_implementation=flash_attention_2 if your GPU supports it
)

processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")

asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=whisper,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

# read your wav file into variable wav. For example:
from io import BytesIO

wav = BytesIO()
with open("record-normalized.wav", "rb") as f:
    wav.write(f.read())

wav.seek(0)  # reset file pointer

# get the transcription
asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256, "return_timestamps": False})
print(asr["text"])

Understanding the Code Through an Analogy

Imagine that you’re a chef preparing a complicated dish. Each ingredient needs to be measured and mixed perfectly to get the desired outcome. In our case, the ingredients represent the various steps you follow in the code:

  • Gathering Ingredients: Importing libraries like torch and transformer is akin to gathering your spices and condiments.
  • Preparation: The preprocessing step is like chopping your vegetables; it’s essential before tossing everything into the pot.
  • Cooking: Setting up the ASR pipeline can be compared to setting the right temperature for cooking; you need to adjust to get optimal results!
  • Serving: Finally, reading the WAV file and getting the transcription is like plating your dish for presentation.

Troubleshooting

While working with the ASR model, you might encounter some issues. Here are a few troubleshooting tips:

  • Issue: If the model does not process the audio file correctly, double-check your preprocessing step to ensure the audio is normalized.
  • Issue: If you encounter memory errors, try reducing the batch size in the pipeline setup.
  • Resource Availability: Ensure your GPU is recognized by PyTorch if CUDA is set. You can also fall back to CPU for processing if needed.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Work in Progress

This model is still in its development phase, aimed at refining its ability to handle phone call speech recognition effectively. If you’re passionate about this project and have a suitable dataset, your contributions will be greatly appreciated!

Conclusion

By following this guide, you can effectively harness the power of the Whisper model tailored for Russian language ASR. As you continue to experiment and innovate, remember that at fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox