How to Fine-Tune the Whisper Model for Automatic Speech Recognition in Portuguese

May 21, 2024 | Educational

Are you ready to unlock the full potential of Automatic Speech Recognition (ASR) with the power of OpenAI’s Whisper model? In this guide, we will walk you through the steps to fine-tune Whisper for specific judicial contexts in the Portuguese language. Get ready to dive into the world of ASR, where your audio inputs will transform into transcribed text seamlessly!

Prerequisites: Setting Up Your Environment

Before we embark on this journey, it’s essential to have the required libraries in place. You can set them up with a series of simple commands:

!pip install transformers
!pip install einops accelerate bitsandbytes
!pip install sentence_transformers
!pip install git+https://github.com/huggingface/peft.git

Loading and Configuring the Model

Now that we have all dependencies installed, let’s load and configure our Whisper model for fine-tuning. Think of this step as preparing the canvas before painting; the cleaner and more detailed your setup is, the better the results will be.

from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, BitsAndBytesConfig

# Step 1: Define your task and language
task = "transcribe"
language = "portuguese"

# Step 2: Configure the model
nf8_config = BitsAndBytesConfig(load_in_8bit=True)
peft_model_id = "rhaymison/legal-whisper-portuguese-peft"
peft_config = PeftConfig.from_pretrained(peft_model_id)

# Step 3: Load the model
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path,
    quantization_config=nf8_config,
    device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)

Explaining the Code: An Analogy

Let’s break down the above code with an analogy. Imagine you’re a chef (the model) preparing a gourmet dish (transcribing audio). First, you gather your ingredients (dependencies) and get your kitchen organized (loading and configuring the model). You need to set the right ambiance (task and language) before you can begin cooking (transcribing audio). As you carefully select each ingredient per the recipe outlined (the various configuration steps), you ensure your dish will be a culinary masterpiece!

Loading the Processor and Preparing Audio

Once your model is set up, it’s time to configure the processor to handle input audio as well as prepare your audio file for processing. This step is essential to ensure your model can ‘understand’ the audio coming in.

from transformers import WhisperProcessor

# Load the audio processor
processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)

# Converting our audio to a sample rate of 16000 and saving
from pydub import AudioSegment

audio = AudioSegment.from_wav("content/audio.wav")
audio = audio.set_frame_rate(16000)
audio.export("z.wav", format="wav")

Creating the Pipeline

The next crucial step is to create a pipeline for automatic speech recognition. You can think of this as setting up the machinery in a factory; each piece must function together to produce the desired output efficiently.

import torch
from transformers import pipeline

# Check if a GPU is available
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Define the pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch.float16
)

Performing Transcription

Finally, it’s time for the moment of truth! You can now transcribe your audio content into text. This is where all your efforts come together as the model listens and understands the audio it receives.

transcription = pipe("content/z.wav", generate_kwargs={"language": "portuguese"})
print(transcription)

Troubleshooting Tips

If you encounter any issues while following these steps, here are some troubleshooting ideas:

Ensure all libraries are correctly installed. Missing dependencies can lead to import errors.
Check that your audio file path is correct and the audio format is supported.
If the model performs poorly, try switching between 4-bit and 8-bit configurations and see which one yields better results.
Monitor your GPU and CPU usage; heavy processing could slow down your system.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox