Getting Started with BERTimbau: Your Guide to Portuguese BERT Models

Dec 18, 2020 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_neuralmind-ai_portuguese-bert

If you’re venturing into the world of Natural Language Processing (NLP) in Portuguese, you’ve landed in the right place. Today, we’ll walk you through the BERTimbau models, pre-trained BERT models designed specifically for the Portuguese language. Think of it as equipping yourself with a state-of-the-art language processing tool that understands the nuances of Portuguese, and is ready to tackle a variety of tasks such as Named Entity Recognition (NER).

What is BERTimbau?

BERTimbau consists of two models: BERT-Base and BERT-Large Cased, both finely tuned using a substantial dataset known as the BrWaC (Brazilian Web as Corpus). These models were trained over 1,000,000 steps using a technique called whole-word masking. This allows them to better understand the context of words in sentences.

Downloading BERTimbau Models

Ready to download the models? Here’s how you can get started:

Head over to Hugging Face to find both the BERT Base and Large models.

Evaluation Benchmarks

These models have been rigorously tested across three primary tasks, providing impressive performance results when measured against prior benchmarks and the Multilingual BERT. Here’s a brief overview:

Task	Test Dataset	BERTimbau-Large	BERTimbau-Base	mBERT	Previous SOTA
STS	ASSIN2	0.852	0.836	0.809	0.83
RTE	ASSIN2	90.0	89.2	86.8	88.3
NER	MiniHAREM (5 classes)	83.7	83.1	79.2	82.3
NER	MiniHAREM (10 classes)	78.5	77.6	73.1	74.6

How to Use BERTimbau in PyTorch

Since our BERTimbau models are compatible with the Hugging Face Transformers library, integrating them into your workflow is straightforward. Here’s how you can load the models into your Python environment:

from transformers import AutoModel, AutoTokenizer

# Using the BERT Base model
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-base-portuguese-cased')

# Using the BERT Large model
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased')

Imagine loading BERTimbau into your NLP system as planting a top-tier seed in a garden. With the right care and resources, it can grow into a robust tree of knowledge, enhancing your language processing capabilities.

Troubleshooting Tips

If you encounter any issues while using the models or setting them up, consider the following troubleshooting steps:

Ensure that you have the correct version of the Hugging Face Transformers library installed.
Double-check the paths specified for the models and tokenizer – make sure they are correctly referenced.
If you are running out of memory, consider using model pruning techniques or try using the smaller BERT-Base model.
For parsing errors or issues with downloading model artifacts, check your internet connection and try again.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Wrapping Up

Utilizing BERTimbau can significantly enhance your Portuguese NLP projects by providing a robust pre-trained model that can easily adapt to various tasks. We’d like to acknowledge Google for their cloud credits which facilitated the training of these remarkable models.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox