How to Pretrain the T5 Model for Vietnamese Text Summarization

Sep 11, 2024 | Educational

Are you ready to dive into the fascinating world of natural language processing and summarize Vietnamese texts with ease? In this guide, we will journey through the steps needed to set up and utilize the T5 (Text-To-Text Transfer Transformer) model for text summarization. Don’t worry; we will make it clear and user-friendly!

Getting Started

Before you start the implementation, ensure you have the necessary libraries installed. You will need `torch` and `transformers` packages. If you haven’t installed them yet, you can do so using pip:

pip install torch transformers

Code Walkthrough

Now let’s explore the code together. Imagine a chef preparing a dish, carefully mixing their ingredients to achieve a perfect flavor. Similarly, we will carefully set up our model for summarization.

Identify Resources: Just like checking if you have all the ingredients, we need to check for GPU availability to speed up our computations. If no GPU is available, the model will run on the CPU.
Load the Model: Think of this step as choosing the right recipe. We load the T5ForConditionalGeneration model and its accompanying tokenizer specific for Vietnamese.
Data Preparation: Just like chopping vegetables, we need to prepare our text. We tokenize the input text, converting it into a format that the model understands.
Summarization Process: The actual summarization is akin to baking. After all the preparation, you let your dish bake, and in our case, we generate the summary through the model.
Output the Result: Finally, just as you plate the dish, we decode the summary from the model back into a human-readable format and display it.

Implementing the Code

Here’s the complete code you’ll need:


import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

if torch.cuda.is_available():
    device = torch.device('cuda')
    print("There are %d GPU(s) available." % torch.cuda.device_count())
    print("We will use the GPU:", torch.cuda.get_device_name(0))
else:
    print("No GPU available, using the CPU instead.")
    device = torch.device('cpu')

model = T5ForConditionalGeneration.from_pretrained('NlpHUST/t5-small-vi-summarization')
tokenizer = T5Tokenizer.from_pretrained('NlpHUST/t5-small-vi-summarization')
model.to(device)

src = """Theo BHXH Việt Nam, nhiều doanh nghiệp vẫn chỉ đóng BHXH cho người lao động theo mức lương. 
Dù quy định từ 1/1/2018, tiền lương tháng đóng BHXH gồm mức lương và thêm khoản bổ sung khác..."""

tokenized_text = tokenizer.encode(src, return_tensors='pt').to(device)
model.eval()
summary_ids = model.generate(
    tokenized_text,
    max_length=256,
    num_beams=5,
    repetition_penalty=2.5,
    length_penalty=1.0,
    early_stopping=True
)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(output)

Expected Output

When executed successfully, the code will produce a summary similar to:


"Nhiều doanh nghiệp vẫn chủ yếu xây dựng thang, bảng lương để đóng BHXH bằng mức thấp nhất. 
Dù quy định từ 1/1/2018, tiền lương tháng đóng BHXH gồm mức lương và thêm khoản bổ sung khác. 
Thống kê của BHXH Việt Nam cho thấy, nhiều doanh nghiệp vẫn chỉ đóng BHXH cho người lao động theo mức lương mà không có khoản bổ sung khác."

Troubleshooting Tips

If you encounter any issues during the setup or execution of the code, here are some troubleshooting ideas:

Ensure that both `torch` and `transformers` are correctly installed and updated to the latest version.
If you receive a GPU-related error, check your CUDA installation or run the model on CPU by adjusting the device setting.
If the summary output is not as expected, consider tweaking the max_length or num_beams parameters to refine the output.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox