Welcome to the world of machine translation! Today, we will explore the Swahili-English translation model (HPLT MT v1.0) powered by OPUS and HPLT data. This step-by-step guide will help you understand how to get started with and utilize this impressive model. Whether you’re a researcher, developer, or a language enthusiast, we promise you will find this guide user-friendly!
What is HPLT MT v1.0?
The HPLT MT v1.0 repository contains a translation model specifically trained to translate Swahili to English. It utilizes data from OPUS and HPLT, employing a Transformer-based architecture along with a SentencePiece tokenizer for optimal performance.
Model Information
- Source Language: Swahili
- Target Language: English
- Dataset: OPUS and HPLT data
- Model Architecture: Transformer-base
- Tokenizer: SentencePiece (Unigram)
Cleaning was performed using OpusCleaner based on a set of basic rules. For further cleaning details, visit the filter files here.
Getting Started: Model Usage
This model is compatible with both MarianNMT and the Hugging Face transformers library. Below, we will break down how to use it with both frameworks.
Using MarianNMT
To run inference with MarianNMT, you can access the detailed guide in the InferenceDecodingTranslation section of our GitHub repository. Make sure you have the necessary files:
- Model file:
model.npz.best-chrf.npz - Vocabulary file:
model.sw-en.spm
Using Transformers
To use the Hugging Face transformers, execute the following script, ensuring you implement a compatible version:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained('HPLTtranslate-sw-en-v1.0-hplt_opus')
model = AutoModelForSeq2SeqLM.from_pretrained('HPLTtranslate-sw-en-v1.0-hplt_opus')
inputs = ['Input goes here., Make sure the language is right.']
batch_tokenized = tokenizer(inputs, return_tensors='pt', padding=True)
model_output = model.generate(
**batch_tokenized, num_beams=6, max_new_tokens=512)
batch_detokenized = tokenizer.batch_decode(
model_output,
skip_special_tokens=True,
)
print(batch_detokenized)
Note: Due to a known issue, avoid using transformer versions 4.26 or 4.30. We recommend installing version 4.28:
pip install transformers==4.28
Understanding the Code: An Analogy
Think of the translation model like a chef who specializes in two distinct cuisines: Swahili and English. The chef has a pantry (the model files) that holds various ingredients (tokenizer and data). When you provide the chef with the right ingredients (the input text), they will carefully mix them together (process the input) using their culinary skills (transformer architecture) to create a delicious dish (the translated output). The chef ensures to follow the recipe (model training) precisely to achieve the best flavors, resulting in a meal that speaks to both taste buds!
Benchmarks
The model has shown impressive results when evaluated using Marian, with the following test scores:
| Test Set | BLEU | chrF++ | COMET22 |
|---|---|---|---|
| FLORES200 | 38.2 | 60.0 | 0.8249 |
| NTREX | 37.1 | 58.1 | 0.8267 |
Troubleshooting Tips
If you run into issues while using this model, consider the following troubleshooting ideas:
- Ensure you are using the correct versions of transformers, especially version 4.28.
- Double-check if all the required files are downloaded correctly and located in the appropriate directory.
- Verify that your input data is formatted correctly for the tokenizer.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Acknowledgements
This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (grant number 10052546). It is brought to you by researchers from the University of Edinburgh and Charles University in Prague with support from the whole HPLT consortium.
Conclusion
With the HPLT MT v1.0 model in your toolkit, you’re well-equipped to tackle Swahili-English translation tasks. Embrace this powerful tool and contribute to the rich tapestry of language translation!

