How to Use GPorTuguese-2 for Portuguese Text Generation

May 24, 2021 | Educational

Are you ready to unlock the potential of text generation in Portuguese using GPorTuguese-2? In this article, we’ll take you through the steps to get started with this state-of-the-art language model. Buckle up, and let’s dive in!

What is GPorTuguese-2?

GPorTuguese-2 is a powerful language model based on the GPT-2 architecture, specifically tailored for Portuguese. With the ability to understand and generate coherent text, it has been trained using data from Portuguese Wikipedia through transfer learning and fine-tuning techniques. Let’s walk through the process of using GPorTuguese-2 in both PyTorch and TensorFlow environments.

Using GPorTuguese-2 with Hugging Face (PyTorch)

Follow these steps to start generating Portuguese text using PyTorch:

Load the Model and Tokenizer

from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

tokenizer = AutoTokenizer.from_pretrained("pierreguillou/gpt2-small-portuguese")
model = AutoModelWithLMHead.from_pretrained("pierreguillou/gpt2-small-portuguese")
# Get sequence length max of 1024
tokenizer.model_max_length = 1024
model.eval()  # disable dropout (or leave in train mode to fine-tune)

Generate One Word

# Input sequence
text = "Quem era Jim Henson? Jim Henson era um"
inputs = tokenizer(text, return_tensors='pt')  # model output
outputs = model(**inputs, labels=inputs['input_ids'])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])

# Results
print("Input text:", text)
print("Predicted text:", predicted_text)

Generate One Full Sequence

# Input sequence
text = "Quem era Jim Henson? Jim Henson era um"
inputs = tokenizer(text, return_tensors='pt')  # model output using Top-k sampling method
sample_outputs = model.generate(inputs['input_ids'], 
                                pad_token_id=50256, 
                                do_sample=True, 
                                max_length=50,  # set the token number you want
                                top_k=40, 
                                num_return_sequences=1)

# Generated sequence
for i, sample_output in enumerate(sample_outputs):
    print("Generated text", i+1, ":", tokenizer.decode(sample_output.tolist()))

Using GPorTuguese-2 with Hugging Face (TensorFlow)

If you prefer TensorFlow, here’s how to implement GPorTuguese-2:

Load the Model and Tokenizer

from transformers import AutoTokenizer, TFAutoModelWithLMHead
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("pierreguillou/gpt2-small-portuguese") 
model = TFAutoModelWithLMHead.from_pretrained("pierreguillou/gpt2-small-portuguese")
# Get sequence length max of 1024
tokenizer.model_max_length = 1024
model.eval()  # disable dropout (or leave in train mode to finetune)

Generate One Full Sequence

# Input sequence
text = "Quem era Jim Henson? Jim Henson era um"
inputs = tokenizer.encode(text, return_tensors='tf')  # model output using Top-k sampling method
outputs = model.generate(inputs, 
                         eos_token_id=50256, 
                         pad_token_id=50256, 
                         do_sample=True, 
                         max_length=40, 
                         top_k=40)

print(tokenizer.decode(outputs[0]))

Understanding the Code: An Analogy

Imagine you are building a chef’s kitchen, but instead of cooking delicious meals, this kitchen is set up to generate meaningful text. Each ingredient represents a part of your code, where:

Load the Model and Tokenizer: Similar to arranging your ingredients (like spices, vegetables, etc.) within reachable cabinets, you initialize the model and tokenizer so they’re ready to use.
Generate One Word: This is akin to tasting your food at various stages. You provide a base input (just like salt enhances flavor), and the model predicts the next word—an important component to create the entire dish.
Generate One Full Sequence: Finally, cooking the dish incrementally. By layering on more words and flavors, the model cooks up a complete sentence, just as a chef would combine ingredients to create a mouthwatering masterpiece.

Troubleshooting

If you encounter any issues while using GPorTuguese-2, here are some troubleshooting tips:

Ensure you have the necessary libraries installed, such as ‘transformers’ and ‘torch’ or ‘tensorflow’ depending on your chosen framework.
Check your input text for unsupported characters or formats that might lead to errors.
Make sure your GPU is set up correctly if you are using one for training or inference, and that it has sufficient memory to accommodate the model.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Now you’re equipped to harness the power of GPorTuguese-2 for generating Portuguese text. The potential applications are vast, from content creation to NLP tasks. Remember to use this tool wisely, keeping in mind ethical considerations and biases associated with language models.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox