How to Use Wav2Vec2-Large-XLSR-53 for Dhivehi Speech Recognition

Aug 24, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_7_1135

In the realm of automatic speech recognition (ASR), the Wav2Vec2-Large-XLSR-53 model offers a remarkable solution for recognizing Dhivehi language audio inputs. With exciting features powered by the Common Voice dataset, using this model can unlock many applications for you. This blog post will guide you through the setup and usage of this advanced ASR model.

Pre-requisites

Ensure your audio input is sampled at 16kHz.
Install required libraries: torch, torchaudio, datasets, and transformers.

How to Use the Model

To begin using the Wav2Vec2 model for speech recognition, follow these detailed steps:

1. Import Necessary Libraries

First, you need to install and import the relevant libraries:


import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

2. Load the Test Dataset

The next step is to load the Common Voice dataset for Dhivehi:


test_dataset = load_dataset('common_voice', 'dv', split='test[:2%]')

3. Prepare the Processor and Model

Now, let’s prepare the processor and load the Wav2Vec2 model:


processor = Wav2Vec2Processor.from_pretrained('shahukareem/wav2vec2-large-xlsr-53-dhivehi-v2')
model = Wav2Vec2ForCTC.from_pretrained('shahukareem/wav2vec2-large-xlsr-53-dhivehi-v2')

4. Preprocess the Audio Files

In this step, we define a function to load and preprocess the audio files:


def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)

5. Make Predictions

Using the model, we will make predictions based on the processed inputs:


inputs = processor(test_dataset['speech'][:2], sampling_rate=16_000, return_tensors='pt', padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)

# Output the predictions
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset['sentence'][:2])

Evaluation of the Model

You can evaluate the model’s performance on the Dhivehi test data using the following code:


wer = load_metric('wer')
test_dataset = load_dataset('common_voice', 'dv', split='test')
# (additional preprocessing steps similar to above...)
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER:", '{:2f}'.format(100 * wer.compute(predictions=result['pred_strings'], references=result['sentence'])))

Troubleshooting

In case you encounter issues, here are some handy troubleshooting ideas:

Ensure all libraries are installed and up to date.
Check that your audio input is correctly formatted to 16kHz.
If you get an error regarding tensor sizes, verify that padding is applied correctly.
For more insights, updates, or to collaborate on AI development projects, stay connected with [fxis.ai](https://fxis.ai).

At [fxis.ai](https://fxis.ai), we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Conclusion

By following the steps outlined in this blog post, you can effectively utilize the Wav2Vec2-Large-XLSR-53 model for Dhivehi speech recognition. With continuous improvements in ASR technologies, harnessing such models will only enhance the future of voice applications across different languages.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox