How to Use the Whisper-Large-V2-Japanese Model for Transcription

Mar 3, 2023 | Educational

The Whisper-Large-V2-Japanese-5k-steps model is designed for transcription tasks using the CommonVoice dataset specifically for the Japanese language. This guide will walk you through the setup and usage for efficient transcription, and provide troubleshooting tips to help you along the way.

Understanding the Model

This particular model is a fine-tuned version of openai/whisper-large-v2 and has undergone training for 5000 steps. However, do note that due to the limited training, the transcriptions may not always be fully satisfactory.

Model and Data Specifications

License: Apache-2.0
Training Data: CommonVoice version 11 train split
Validation Data: CommonVoice version 11 validation split
Test Data: CommonVoice version 11 test split
Loss Achieved: 0.4200
Word Error Rate: 0.7449

How to Run the Model

To perform transcription using the model, you will need to follow these steps:

Set up your environment with the necessary libraries (e.g., Transformers, Datasets).
Load the Whisper model and associated processing tools.
Prepare your dataset for transcription.
Generate the transcription and evaluate its performance.

Transcription Code Example

Below is a simple example of how to set up the model and create a transcription:

from datasets import load_dataset, Audio
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model
processor = WhisperProcessor.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps")
model = WhisperForConditionalGeneration.from_pretrained("clu-ling/whisper-large-v2-japanese-5k-steps").to(device)

# Load the dataset
commonvoice_eval = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="validation", streaming=True)
commonvoice_eval = commonvoice_eval.cast_column("audio", Audio(sampling_rate=16000))
sample = next(iter(commonvoice_eval))["audio"]

# Features and generate token ids
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features.to(device))

# Decode transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

Evaluating the Model

To evaluate the transcription output, you can use the word error rate (WER) metric. It measures the number of errors in the transcribed text compared to the actual text, which is crucial for assessing model performance.

from transformers.models.whisper.english_normalizer import BasicTextNormalizer
from datasets import load_dataset, Audio
import evaluate
import torch

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Metric
wer_metric = evaluate.load("wer")

# Load the test dataset
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="test")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

# For debugging
# dataset = dataset.shard(num_shards=7000, index=0)

def map_wer(batch):
    model.to(device)
    inputs = processor(batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"], return_tensors="pt").input_features
    with torch.no_grad():
        generated_ids = model.generate(inputs=inputs.to(device))
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    batch["predicted_text"] = transcription
    return batch

# Process dataset
predicted = dataset.map(map_wer)
wer = wer_metric.compute(references=predicted["gold_text"], predictions=predicted["predicted_text"])
wer = round(100 * wer, 2)
print("WER:", wer)

Troubleshooting

If you encounter issues while using the Whisper model, consider the following troubleshooting tips:

Ensure that your environment has all the necessary libraries installed. You can use pip to install the required packages.
Check your device settings. Make sure that the code is accessing your GPU if available, or adjust accordingly for CPU.
If an error occurs related to data loading, verify the paths and availability of the CommonVoice dataset.
Revisit the batch sizes in your training parameters; adjusting these can often resolve performance bottlenecks.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

The Whisper-Large-V2 model for Japanese transcription offers a robust tool for those working with the CommonVoice dataset. As advancements continue, ensuring you understand the model’s capabilities and limitations will enhance your AI projects. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox