Are you looking to harness the power of AI for seamless English to Vietnamese translation? Look no further! This article will guide you through the process of utilizing the Text-To-Text Transfer Transformer (T5) model for your translation needs.
Dataset and Preparation
The translation journey begins with the right dataset. For our task, we will use the IWSLT15 English-Vietnamese dataset provided by the Stanford NLP group. The dataset is split into three sets:
- Training: 133,317 sentences (Download via GitHub)
- Development: 1,553 sentences (Download via GitHub)
- Test: 1,268 sentences (Download via GitHub)
Model Performance
The effectiveness of our translation model can be measured using the BLEU score. Here are the results obtained on the test set:
| Model | BLEU (Beam Search) |
|---|---|
| [Luong & Manning (2015)](https://nlp.stanford.edu/pubs/luong-manning-iwslt15.pdf) | 23.30 |
| Sequence-to-sequence model with attention | 26.10 |
| [Neural Phrase-based Machine Translation (Huang et al. 2017)](https://arxiv.org/abs/1706.05565) | 27.69 |
| t5-en-vi-small (pretraining, without training data) | 28.46 (cased), 29.23 (uncased) |
| t5-en-vi-small (fine-tuning with training data) | 32.38 (cased), 33.19 (uncased) |
| t5-en-vi-base (pretraining, without training data) | 29.66 (cased), 30.37 (uncased) |
Using the T5 Model
Now, let’s delve into the code to see how we can implement the translation model. Here’s a simplified analogy: think of the T5 model as a well-trained translator who can understand both English and Vietnamese. The input sentences are like notes fed to this translator who returns a perfectly translated version.
The provided code performs the following actions:
- Checks if a GPU is available for processing (a speed booster!),
- Imports the pre-trained T5 model and tokenizer for English-Vietnamese translation,
- Processes an example sentence and generates a translation.
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
if torch.cuda.is_available():
device = torch.device('cuda')
print("There are %d GPU(s) available." % torch.cuda.device_count())
print("We will use the GPU:", torch.cuda.get_device_name(0))
else:
print("No GPU available, using the CPU instead.")
device = torch.device('cpu')
model = T5ForConditionalGeneration.from_pretrained('NlpHUST/t5-en-vi-small')
tokenizer = T5Tokenizer.from_pretrained('NlpHUST/t5-en-vi-small')
model.to(device)
src = "In school, we spent a lot of time studying the history of Kim Il-Sung, but we never learned much about the outside world, except that America, South Korea, Japan are the enemies."
tokenized_text = tokenizer.encode(src, return_tensors='pt').to(device)
model.eval()
summary_ids = model.generate(
tokenized_text,
max_length=128,
num_beams=5,
repetition_penalty=2.5,
length_penalty=1.0,
early_stopping=True
)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(output)
Output
The output of the code should be a translated version of your input sentence. In our case, it may look something like:
Ở trường, chúng tôi dành nhiều thời gian để nghiên cứu về lịch sử Kim Il-Sung, nhưng chúng tôi chưa bao giờ học được nhiều về thế giới bên ngoài, ngoại trừ Mỹ, Hàn Quốc, Nhật Bản là kẻ thù.
Troubleshooting
If you encounter issues while executing the code or during the translation process, here are some ideas on how to troubleshoot:
- Ensure you have all necessary libraries installed, especially
transformersandtorch. - Check your internet connection if you have trouble downloading the pre-trained models.
- If you receive an error regarding CUDA, make sure your GPU drivers are up to date.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

