BERTimbau Base: Harnessing the Power of BERT for Brazilian Portuguese

Jun 17, 2022 | Educational

BERTimbau Base is a pretrained BERT model tailored for Brazilian Portuguese, achieving remarkable performance on crucial NLP tasks: Named Entity Recognition, Sentence Textual Similarity, and Recognizing Textual Entailment. This model comes in two sizes: Base and Large, allowing flexibility based on your project’s requirements.

Available Models

BERTimbau Base offers two versions you can leverage for your applications:

neuralmindbert-base-portuguese-cased: BERT-Base, 12 layers, 110M parameters
neuralmindbert-large-portuguese-cased: BERT-Large, 24 layers, 335M parameters

How to Use BERTimbau Base

Utilizing BERTimbau is straightforward! Follow these steps carefully to get started:

1. Loading the Model

To load the model and tokenizer, use the following code:

from transformers import AutoTokenizer
from transformers import AutoModelForPreTraining

model = AutoModelForPreTraining.from_pretrained("neuralmind/bert-base-portuguese-cased")
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased", do_lower_case=False)

2. Performing Masked Language Modeling

Next, you can utilize BERTimbau to predict masked words in a sentence:

from transformers import pipeline

pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)
results = pipe("Tinha uma [MASK] no meio do caminho.")

This code will analyze your input sentence to find potential words that fit in the masked position!

3. Extracting BERT Embeddings

To gain insights into BERT embeddings, consider the following code:

import torch

input_ids = tokenizer.encode("Tinha uma pedra no meio do caminho.", return_tensors="pt")
with torch.no_grad():
    outs = model(input_ids)
    encoded = outs[0][0, 1:-1]  # Ignore [CLS] and [SEP] special tokens

This will output a tensor that represents the word embeddings while omitting special tokens, providing a richer understanding of the sentence structure.

Troubleshooting Tips

As you work with BERTimbau Base, you may encounter some common issues. Here are a few troubleshooting tips:

Model Not Found: Ensure you are using the correct model name when loading.
Tokenization Issues: Double-check that you are using the appropriate tokenizer that corresponds to the model.
Insufficient Memory: If the model is too large to load, consider using the Base model instead of the Large.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

BERTimbau Base is a critical tool for anyone looking to leverage advanced NLP for Brazilian Portuguese. With its exceptional capabilities, you have a reliable ally in the world of language processing.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox