Your Guide to Using the Swahili-English Translation Model (HPLT MT v1.0)

Mar 14, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_17_188

Welcome to the world of machine translation! Today, we will explore the Swahili-English translation model (HPLT MT v1.0) powered by OPUS and HPLT data. This step-by-step guide will help you understand how to get started with and utilize this impressive model. Whether you’re a researcher, developer, or a language enthusiast, we promise you will find this guide user-friendly!

What is HPLT MT v1.0?

The HPLT MT v1.0 repository contains a translation model specifically trained to translate Swahili to English. It utilizes data from OPUS and HPLT, employing a Transformer-based architecture along with a SentencePiece tokenizer for optimal performance.

Model Information

Source Language: Swahili
Target Language: English
Dataset: OPUS and HPLT data
Model Architecture: Transformer-base
Tokenizer: SentencePiece (Unigram)

Cleaning was performed using OpusCleaner based on a set of basic rules. For further cleaning details, visit the filter files here.

Getting Started: Model Usage

This model is compatible with both MarianNMT and the Hugging Face transformers library. Below, we will break down how to use it with both frameworks.

Using MarianNMT

To run inference with MarianNMT, you can access the detailed guide in the InferenceDecodingTranslation section of our GitHub repository. Make sure you have the necessary files:

Model file: model.npz.best-chrf.npz
Vocabulary file: model.sw-en.spm

Using Transformers

To use the Hugging Face transformers, execute the following script, ensuring you implement a compatible version:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained('HPLTtranslate-sw-en-v1.0-hplt_opus')
model = AutoModelForSeq2SeqLM.from_pretrained('HPLTtranslate-sw-en-v1.0-hplt_opus')

inputs = ['Input goes here., Make sure the language is right.']
batch_tokenized = tokenizer(inputs, return_tensors='pt', padding=True)

model_output = model.generate(
    **batch_tokenized, num_beams=6, max_new_tokens=512)
batch_detokenized = tokenizer.batch_decode(
    model_output,
    skip_special_tokens=True,
)

print(batch_detokenized)

Note: Due to a known issue, avoid using transformer versions 4.26 or 4.30. We recommend installing version 4.28:

pip install transformers==4.28

Understanding the Code: An Analogy

Think of the translation model like a chef who specializes in two distinct cuisines: Swahili and English. The chef has a pantry (the model files) that holds various ingredients (tokenizer and data). When you provide the chef with the right ingredients (the input text), they will carefully mix them together (process the input) using their culinary skills (transformer architecture) to create a delicious dish (the translated output). The chef ensures to follow the recipe (model training) precisely to achieve the best flavors, resulting in a meal that speaks to both taste buds!

Benchmarks

The model has shown impressive results when evaluated using Marian, with the following test scores:

Test Set	BLEU	chrF++	COMET22
FLORES200	38.2	60.0	0.8249
NTREX	37.1	58.1	0.8267

Troubleshooting Tips

If you run into issues while using this model, consider the following troubleshooting ideas:

Ensure you are using the correct versions of transformers, especially version 4.28.
Double-check if all the required files are downloaded correctly and located in the appropriate directory.
Verify that your input data is formatted correctly for the tokenizer.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Acknowledgements

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (grant number 10052546). It is brought to you by researchers from the University of Edinburgh and Charles University in Prague with support from the whole HPLT consortium.

Conclusion

With the HPLT MT v1.0 model in your toolkit, you’re well-equipped to tackle Swahili-English translation tasks. Embrace this powerful tool and contribute to the rich tapestry of language translation!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox