How to Use Wav2Vec2-Large-XLSR-53 for Automatic Speech Recognition in Turkish

Apr 3, 2021 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_14_1062

Want to explore the world of automatic speech recognition? Using the Wav2Vec2-Large-XLSR-53 model for Turkish language processing is a fantastic way to delve into this exciting technology. Below, we’ll take you through the steps necessary to set up and utilize this powerful deep-learning tool.

Understanding the Model Architecture

Before we jump into the how-to’s, let’s understand what the Wav2Vec2 model does. Think of it like a seasoned barista (the model) who is trained to take your order (your speech) and prepare it perfectly (convert it into text). Just as the barista needs to recognize various forms of orders (different pronunciations and accents), the Wav2Vec2 model is trained on a vast dataset to understand and transcribe spoken Turkish accurately.

Setting Up Your Environment

To get started, make sure you have the required libraries installed. Here’s how to set up your environment:

Install PyTorch: pip install torch
Install Torchaudio: pip install torchaudio
Install the Hugging Face Transformers and Datasets: pip install transformers datasets

Using the Model

Now that your environment is ready, let’s use the model to process some speech input. Here’s the code snippet you can use:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load dataset
test_dataset = load_dataset('common_voice', 'tr', split='test[:2%]')
processor = Wav2Vec2Processor.from_pretrained('ozcangundes/wav2vec2-large-xlsr-53-turkish')
model = Wav2Vec2ForCTC.from_pretrained('ozcangundes/wav2vec2-large-xlsr-53-turkish')
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset['speech'][:2], sampling_rate=16_000, return_tensors='pt', padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset['sentence'][:2])

Evaluating the Model

Once the speech has been processed, it’s time to evaluate the model. Below is a handy code snippet to help you assess its performance on the Turkish test data from the Common Voice dataset:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load dataset and metric
test_dataset = load_dataset('common_voice', 'tr', split='test')
wer = load_metric('wer')

processor = Wav2Vec2Processor.from_pretrained('ozcangundes/wav2vec2-large-xlsr-53-turkish')
model = Wav2Vec2ForCTC.from_pretrained('ozcangundes/wav2vec2-large-xlsr-53-turkish')
model.to('cuda')
chars_to_ignore_regex = '[,?.!-;:“%‘”’]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

def speech_file_to_array_fn(batch):
    batch['sentence'] = re.sub(chars_to_ignore_regex, '', batch['sentence']).lower()
    speech_array, sampling_rate = torchaudio.load(batch['path'])
    batch['speech'] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

def evaluate(batch):
    inputs = processor(batch['speech'], sampling_rate=16_000, return_tensors='pt', padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to('cuda'), attention_mask=inputs.attention_mask.to('cuda')).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch['pred_strings'] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER:", 100 * wer.compute(predictions=result['pred_strings'], references=result['sentence']))

Training the Model

For those interested in how this model was trained, it utilized the Common Voice train and validation datasets. You can view the script used for training here.

Troubleshooting Tips

If you encounter any issues during your setup or execution of the code, here are a few troubleshooting tips:

Model Not Found: Double-check that the model path is correct in your code. Sometimes a simple typo can lead to this error.
Audio Quality Issues: Ensure that the audio input is of good quality and is sampled at 16kHz, as the model expects this format.
Dependencies Missing: If you receive errors regarding missing libraries, ensure you’ve installed all required dependencies as mentioned above.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Leveraging the capabilities of Wav2Vec2-Large-XLSR-53 for Turkish speech recognition opens up a world of possibilities in natural language processing. With the instructions provided above, you’re now equipped to start experimenting with automatic speech recognition.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox