How to Use SWRA (SWARA) for Automatic Speech Recognition

Aug 16, 2024 | Educational

Are you ready to dive into the exciting world of Automatic Speech Recognition (ASR) using the SWRA (SWARA) model? In this user-friendly guide, we’ll walk you through the steps to implement SWRA for transcribing speech into text. The model is trained on the influential LibriSpeech dataset, known for its accuracy in understanding spoken English.

What is SWRA?

SWRA (SWARA) is an end-to-end sequence-to-sequence transformer model designed for ASR. Think of it as a highly skilled translator that listens to spoken words and writes them down in real-time. It’s like having an assistant who not only hears what you say but also captures every nuance of your speech!

Getting Started

To get started with SWRA, here’s what you need to do:

  • Install Required Packages: You will require the torchaudio and sentencepiece packages to process audio features and tokenize your input. You can install them either as extra dependencies or separately.
  • Load the Required Libraries: Import the necessary libraries to set up the model.

Step 1: Installing Packages

Start by installing the required packages. Run the following command in your terminal:

pip install transformers[speech, sentencepiece]

Step 2: Importing Required Libraries

Next, import the necessary libraries in your Python environment:

import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset

Step 3: Loading the Model and Processor

Now, you’ll load the SWRA model and its processor:

model = Speech2TextForConditionalGeneration.from_pretrained("binarybardakshat/swara-swara")
processor = Speech2TextProcessor.from_pretrained("binarybardakshat/swara-swara")

Step 4: Preparing the Dataset and Transcribing

Load your dataset, prepare your audio input, and generate the transcript:

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
input_features = processor(ds[0]["audio"]["array"], sampling_rate=16_000, return_tensors="pt").input_features  

generated_ids = model.generate(input_features=input_features)
transcription = processor.batch_decode(generated_ids)

Evaluating the Model

To evaluate the SWRA model on the LibriSpeech test data, you’ll need to perform an additional set of operations:

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")  # change to "other" for the other test dataset
wer = load("wer")
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr").to("cuda")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr", do_upper_case=True)

def map_to_pred(batch):
    features = processor(batch["audio"]["array"], sampling_rate=16000, padding=True, return_tensors="pt")
    input_features = features.input_features.to("cuda")
    attention_mask = features.attention_mask.to("cuda")
    gen_tokens = model.generate(input_features=input_features, attention_mask=attention_mask)
    batch["transcription"] = processor.batch_decode(gen_tokens, skip_special_tokens=True)[0]
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
print("WER:", wer.compute(predictions=result["transcription"], references=result["text"]))

Troubleshooting

If you encounter issues while setting up or running the model, here are some troubleshooting tips:

  • Installation Problems: Ensure that you have Python and pip installed. Confirm that you run the installation commands in the correct environment (virtualenv or conda).
  • CUDA Issues: Make sure that your system supports CUDA if you’re using GPU acceleration. Check your PyTorch installation to confirm compatibility.
  • Feature Extraction Errors: If the processor fails to extract audio features, double-check the audio format and its compatibility with the required sampling rate.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With just a few steps, you can unleash the power of the SWRA model for automatic speech recognition. This scalable technology is transforming how we process spoken language, opening doors to new possibilities in various applications.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox