Aina Projects: Catalan Text-to-Speech Model

Aug 23, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_14_3325

In this blog, we will explore the capabilities of an innovative model designed for Automatic Speech Recognition (ASR) in Catalan. With a fine-tuned model derived from a Spanish version, it utilizes cutting-edge technology to transcribe audio into plain text. Let’s dive into how you can use this model effectively!

Model Description

This model transcribes audio samples in Catalan to lowercase text without punctuation. Originating from a pre-trained Spanish model (stt-es-citrinet-512), it operates on the Common Voice 11.0 dataset, boasting approximately 36.5 million parameters. Whether you’re looking to transcribe voice memos, lectures, or any other audio file in Catalan, this model is up for the task!

Intended Uses and Limitations

Transcribes audio files in Catalan to plain text.
No punctuation is included in the transcriptions.
Best suited for ASR projects focusing on the Catalan language.

However, keep in mind the limitations regarding context understanding, as it may not capture nuances present in spoken language.

How to Use the Model

Setup Requirements

To get started, ensure you have the following libraries installed:

pip install nemo_toolkit[all]

Clone the Repository

To download the model, clone the necessary repository:

git clone https://huggingface.co/projecte-aina/stt-ca-citrinet-512

Transcribing Audio Files

Once you have the model downloaded, you can transcribe audio files by following these steps:


# Load the model
model = nemo_asr.models.EncDecCTCModel.restore_from(NEMO_PATH)

# Create a list pointing to the audio file paths
paths2audio_files = [audio_1.wav, ..., audio_n.wav]

# Fix the batch size to whatever number suits your purpose
batch_size = 8

# Transcribe the audio files
transcriptions = model.transcribe(paths2audio_files=paths2audio_files, batch_size=batch_size)

# Visualize the transcriptions
print(transcriptions)

Understanding the Code: An Analogy

Let’s break down the code above using an analogy. Imagine you’re a chef preparing a delicious meal (transcribing audio). First, you need to gather the right ingredients (audio files). After you load your recipes (model), you organize your kitchen (define your paths). You then decide how many dishes you can cook at once (batch size). Finally, you follow the instructions (transcribe) to create your meal (output transcriptions). Each ingredient transforms into a tasty dish (text output) ready to be served!

Training Data

The model was trained on the training split of the Common Voice 11.0 dataset, ensuring robust performance in the Catalan language.

Data Preparation and Training Procedure

The data preparation process involved using the NeMo toolkit, processing audio manifests, and cleaning the dataset for optimal results. The initial learning rate during training was 0.005, which saw incremental adjustments throughout the procedure to enhance learning.

Evaluation

Upon testing, this model achieved a Word Error Rate (WER) of 6.684, marking its efficiency while handling Catalan audio inputs.

Troubleshooting

If you encounter issues while using this model, consider the following troubleshooting steps:

Ensure that the NeMo toolkit is both installed and updated to the latest version.
Verify the file paths for the audio files to ensure they are accurate.
Adjust the batch size if you encounter memory errors or inefficiencies during the transcription process.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Additional Information

The model was developed by the Text Mining Unit (TeMU) at the Barcelona Supercomputing Center. For inquiries, you can reach out at aina@bsc.es.

This project was funded by the Generalitat de Catalunya as part of the Projecte AINA.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox