How to Train and Utilize T5 for English-Vietnamese Translation

Sep 11, 2024 | Educational

In this blog post, we’ll explore the steps to pretrain a Text-To-Text Transfer Transformer (T5) specifically for translating between English and Vietnamese. We will be using the IWSLT15 dataset, which is a well-known resource in the machine translation community.

Step 1: Download the Dataset

The IWSLT15 English-Vietnamese data, provided by the Stanford NLP group, can be obtained from their website. For our experiments, the corpus is divided into three sets: training, development, and test sets.

Training Set: 133,317 Sentences – Download from GitHub
Development Set: 1,553 Sentences – Download from GitHub
Test Set: 1,268 Sentences – Download from GitHub

Step 2: Set Up the Environment

To utilize T5 for translation tasks, you need to have the proper environment set up. This involves installing the necessary libraries, including PyTorch and Hugging Face’s Transformers. Make sure to have the correct version compatible with your system.

Step 3: Pretraining the Model

The following code snippet demonstrates how to import and use T5 for your translation tasks:

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"There are {torch.cuda.device_count()} GPU(s) available.")
    print(f"We will use the GPU: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU available, using the CPU instead.")
    device = torch.device('cpu')

model = T5ForConditionalGeneration.from_pretrained('NlpHUST/t5-en-vi-small')
tokenizer = T5Tokenizer.from_pretrained('NlpHUST/t5-en-vi-small')
model.to(device)

src = "In school, we spent a lot of time studying the history of Kim Il-Sung, but we never learned much about the outside world, except that America, South Korea, Japan are the enemies."
tokenized_text = tokenizer.encode(src, return_tensors='pt').to(device)

model.eval()
summary_ids = model.generate(
    tokenized_text,
    max_length=128,
    num_beams=5,
    repetition_penalty=2.5,
    length_penalty=1.0,
    early_stopping=True
)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(output)

Step 4: Understanding the Code

Think of this code as setting up an automated translation assistant. You first check if your assistant (GPU) is available to help with fast translations. If it’s not, you’ll have to settle for the slower, but reliable, manual effort (CPU).

You bring in your assistant (T5 model) and equip it with the necessary knowledge (tokenizer).
You present a text (your quote), and the assistant understands and translates it.
Finally, you receive the translated message back, ready for use!

Step 5: Interpreting the Output

Upon running the script, you should see an output similar to this:

Ở trường, chúng tôi dành nhiều thời gian để nghiên cứu về lịch sử Kim Il-Sung, nhưng chúng tôi chưa bao giờ học được nhiều về thế giới bên ngoài, ngoại trừ Mỹ, Hàn Quốc, Nhật Bản là kẻ thù.

Troubleshooting

As with any programming endeavor, you may run into issues. Below are some common troubleshooting tips:

If your environment does not recognize PyTorch or Transformers, ensure that they’re properly installed and that your Python version is compatible.
If CUDA is not available even though you have a compatible GPU, check your NVIDIA drivers and verify your installation of PyTorch.
If you’re getting poor translation results, consider fine-tuning your pre-trained model with additional training data to enhance performance.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following these steps, you can successfully utilize the T5 model for English-Vietnamese translation tasks. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox