How to Utilize the wav2vec2-xls-r-300m-cv6-turkish ASR Model

Mar 26, 2022 | Educational

Automatic Speech Recognition (ASR) technology has come a long way in enhancing speech understanding in various languages. In this article, we will explore how to leverage the wav2vec2-xls-r-300m-cv6-turkish model, a fine-tuned ASR model specifically designed for the Turkish language. We’ll break down the steps for training, evaluation, and troubleshooting to ensure a user-friendly experience.

Model Description

The wav2vec2-xls-r-300m-cv6-turkish model is based on the foundational wav2vec2 architecture developed by Facebook. It has been fine-tuned on the Turkish language, providing remarkable accuracy in understanding and transcribing spoken Turkish.

Training and Evaluation Data

This model was fine-tuned using two primary datasets:

Common Voice 6.1 TR: All validated splits except the test split were utilized for training.
MediaSpeech.

Training Procedure

Custom pre-processing and loading steps were executed to handle the distinct characteristics of both datasets effectively. The GitHub repository wav2vec2-turkish served as a resource for this purpose.

Training Hyperparameters

Below are the hyperparameters used during the fine-tuning process:

Learning Rate: 2e-4
Number of Training Epochs: 10
Warmup Steps: 500
Freeze Feature Extractor: Yes
Mask Time Probability: 0.1
Mask Feature Probability: 0.1
Feature Projection Dropout: 0.05
Attention Dropout: 0.05
Final Dropout: 0.1
Activation Dropout: 0.05
Batch Size (Train): 8
Batch Size (Eval): 8
Gradient Accumulation Steps: 8

Framework Versions

Transformers: 4.17.0.dev0
Pytorch: 1.10.1
Datasets: 1.18.3
Tokenizers: 0.10.3

Language Model

An N-gram language model was created using Turkish Wikipedia articles, utilizing KenLM. The ngram-lm-wiki repository assisted in generating the ARPA Language Model and converting it into a binary format.

Evaluation Commands

Before running evaluations, ensure to install the unicode_tr package, which is essential for Turkish text processing.

Follow these commands to evaluate the model:

bash python eval.py --model_id mpoyrazwav2vec2-xls-r-300m-cv6-turkish --dataset common_voice --config tr --split test

bash python eval.py --model_id mpoyrazwav2vec2-xls-r-300m-cv6-turkish --dataset speech-recognition-community-v2dev_data --config tr --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Evaluation Results

The following metrics highlight the model’s performance:

Dataset	WER	CER
Common Voice 6.1 TR test split	8.83	2.37
Speech Recognition Community dev data	32.81	11.22

Troubleshooting Tips

If you encounter issues while working with this ASR model, consider the following troubleshooting steps:

Ensure all required packages are installed, especially unicode_tr.
Double-check your command syntax for evaluations to ensure everything is typed correctly.
Be mindful of environment dependencies; the model works best with the specified framework versions.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox