In the realm of machine translation, the seamless transformation from one language to another can make a world of difference, especially amidst the diverse linguistic landscape of Spain and Catalonia. This guide will walk you through the essentials of using the PlanTL Spanish-Catalan machine translation model, empowering you to harness its capabilities effectively.
Table of Contents
- Model Description
- Intended Uses and Limitations
- How to Use
- Training
- Evaluation
- Additional Information
Model Description
The PlanTL model is built from scratch using the Fairseq toolkit and trained on a robust dataset comprised of 92 million sentences spanning various domains such as general, administrative, technology, biomedical, and news.
Intended Uses and Limitations
This machine translation model is designed for translating sentences from Spanish to Catalan. However, it’s essential to be aware of its limitations, particularly when dealing with highly specialized or nuanced content.
How to Use
To get started with the PlanTL model, you’ll need to install a couple of essential libraries:
bash
pip install ctranslate2 pyonmttok
Next, use the following Python code to translate a sentence:
python
import ctranslate2
import pyonmttok
from huggingface_hub import snapshot_download
# Download the model
model_dir = snapshot_download(repo_id="PlanTL-GOB-ES/mt-plantl-es-ca", revision="main")
# Initialize the tokenizer
tokenizer = pyonmttok.Tokenizer(mode="none", sp_model_path=model_dir + "/spm.model")
# Tokenize the input sentence
tokenized = tokenizer.tokenize("Bienvenido al Proyecto PlanTL!")
# Initialize the translator
translator = ctranslate2.Translator(model_dir)
# Translate the tokenized input
translated = translator.translate_batch([tokenized[0]])
print(tokenizer.detokenize(translated[0][0]["tokens"]))
Here, we can think of the model as a highly skilled interpreter in a bustling multilingual conference. Just as the interpreter listens carefully to participants, processes their words, and delivers a perfect translation to the listeners, this model tokenizes the sentences, processes them, and delivers translated output. Each function plays a critical role, just like the methods of the interpreter.
Training
Training Data
The model draws from an extensive range of datasets, totaling over 92 million sentences. Here’s a brief overview of the datasets used:
- DOCG v2: 8,472,786 Sentences
- El Periodico: 6,483,106 Sentences
- EuroParl: 1,876,669 Sentences
- WikiMatrix: 1,421,077 Sentences
- Wikimedia: 335,955 Sentences
- QED: 71,867 Sentences
- TED2020 v1: 52,177 Sentences
- CCMatrix v1: 56,103,820 Sentences
- MultiCCAligned v1: 2,433,418 Sentences
- ParaCrawl: 15,327,808 Sentences
Training Procedure
Data Preparation
The data underwent a thorough cleaning process to ensure quality and consistency, filtering, and formatting it for effective training.
Tokenization
Using SentencePiece, all data was tokenized, ensuring that the model could recognize and process linguistic nuances efficiently.
Hyperparameters
The model leverages key hyperparameters tailored for optimal performance, including:
- Architecture: transformer_vaswani_wmt_en_de_bi
- Embedding size: 1024
- Feedforward size: 4096
- Number of heads: 16
- Encoder layers: 24
- Learning rate: 1e-3
Evaluation
Variable and Metrics
The effectiveness of the model is gauged using the BLEU score against various test sets, ensuring its quality and reliability.
Evaluation Results
Here’s how the PlanTL model compares with other popular translation tools:
Test set SoftCatála Google Translate mt-plantl-es-ca
Spanish Constitution 63.6 61.7 63.0
United Nations 73.8 74.8 74.9
Flores 101 dev 22 23.1 22.5
Cybersecurity 61.4 69.5 67.3
Average 46.4 47.8 47.6
Additional Information
Author
Developed by the Text Mining Unit (TeMU) at the Barcelona Supercomputing Center.
Contact Information
For inquiries, reach out to plantl-gob-es@bsc.es.
Copyright and Licensing Information
Copyright is held by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA), and is licensed under the Apache License, Version 2.0.
Disclaimer
The models available here are intended for general use and may contain biases that need to be addressed. It’s essential for third parties using these models for deployment or service provision to take responsibility for their application while adhering to applicable AI regulations.
Troubleshooting
If you encounter issues while using the PlanTL model, here are some troubleshooting tips:
- Ensure all required libraries are installed correctly.
- Verify that the model directory path is accurately specified.
- Check if your input sentences are appropriately tokenized before translation.
- If you experience slow performance, consider inspecting the batch size and available hardware resources.
- For further assistance or insights, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

