How to Use the GPT2-Large Model Trained on Data from the National Library of Spain (BNE)

Nov 27, 2022 | Educational

In this blog, we will explore how to effectively use the GPT2-large-bne, a transformer-based language model tailored for Spanish text generation, developed using extensive data from the National Library of Spain. This guide is designed to be user-friendly and will walk you through the essential steps to harness this powerful tool.

Overview

  • Architecture: gpt2-large
  • Language: Spanish
  • Task: Text generation
  • Data: BNE (Spanish corpus)

Model Description

The GPT2-large-bne model is a specialized version of the popular GPT-2, trained exclusively with a large-scale corpus of 570GB of clean, Spanish text collected from the web. This corpus was compiled through extensive crawling conducted by the National Library of Spain (Biblioteca Nacional de España) between 2009 and 2019.

Intended Uses and Limitations

This model can be utilized for generating text or fine-tuning for specific tasks. However, it is essential to understand its limitations, particularly concerning potential biases embedded in the training data due to varied web sources.

How to Use the GPT2-Large-BNE Model

Here’s a step-by-step guide on how to use this model for text generation:

python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed

tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
model = AutoModelForCausalLM.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
generator = pipeline("text-generation", tokenizer=tokenizer, model=model)

set_seed(42)
generator("La Biblioteca Nacional de España es una entidad pública y sus fines son", num_return_sequences=5)

Once executed, this will generate various continuations for the provided text. Think of it as planting a seed in a garden; each run of the model is like watering that seed, allowing new and rich ideas to sprout.

Features of a Given Text in PyTorch

You can also extract features from a text as follows:

python
from transformers import AutoTokenizer, GPT2Model

tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")
model = GPT2Model.from_pretrained("PlanTL-GOB-ES/gpt2-large-bne")

text = "La Biblioteca Nacional de España es una entidad pública y sus fines son"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

print(output.last_hidden_state.shape)

This will give you the shape of the output tensor, which represents the model’s understanding of the input text.

Limitations and Bias

No measures have yet been taken to estimate the biases within the model. It is crucial to recognize that language models can reflect societal biases inherent in the data they were trained on. For instance, the model might produce biased generated predictions based on prompts reflecting traditional gender roles.

Troubleshooting

If you encounter issues using the model or generating text, consider the following suggestions:

  • Ensure that you have the correct version of the transformers library installed.
  • Check your internet connection as the model may need to download resources from the Hugging Face library.
  • If generating text is not yielding results, try changing the seed or adjusting the prompt to see different outputs.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Additional Information

For more details on model performance, you can reach out to the authors at the Text Mining Unit (TeMU) at the Barcelona Supercomputing Center via email at bsc-temu@bsc.es.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox