Automatic speech recognition (ASR) is a rapidly growing area in artificial intelligence, enabling machines to convert spoken language into written text. One powerful model for ASR is Facebook’s Wav2Vec2, which has been pretrained and fine-tuned on 960 hours of Librispeech audio. This blog will guide you through using this model for transcribing audio files and evaluating its performance.
Setting Up Wav2Vec2
Before diving into the transcribing process, ensure you have the necessary libraries installed. You will need the transformers and datasets libraries from Hugging Face.
pip install transformers datasets torch jiwer
Transcribing Audio Files
Think of Wav2Vec2 as a skilled interpreter fluent in both your spoken words and written languages. It begins with absorbing the sounds, understanding the nuances, and finally converting them into text—the transcript you wait for! Here’s how to use it:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
# Load example dataset
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
# Tokenize and transcribe
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)
Evaluating the Model’s Performance
After transcribing, it’s essential to evaluate the quality of your results. The WER (Word Error Rate) is a crucial metric for this purpose, providing insight into the model’s accuracy. A low WER indicates a high-quality transcription.
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer
# Load the evaluation dataset
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
def map_to_pred(batch):
input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
with torch.no_grad():
logits = model(input_values.to("cuda")).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
batch["transcription"] = transcription
return batch
result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])
print("WER:", wer(result["text"], result["transcription"]))
Understanding the Results
The results from the evaluation code will give you two figures for WER—one for the Clean dataset and one for the Other dataset.
- Clean WER: 3.4
- Other WER: 8.6
A lower WER signifies a more accurate transcription, providing confidence in the model’s performance.
Troubleshooting Tips
If you encounter any issues while using Wav2Vec2, here are some troubleshooting ideas:
- Ensure that the audio input is correctly sampled at 16 kHz, as this is crucial for the model.
- Check your environment setup and confirm that all required libraries are properly installed.
- If using a GPU, verify that your CUDA environment is correctly configured.
- In case of unexpected results or errors, refer to the official documentation or community forums for solutions.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
