How to Use UmBERTo: A Roberta-Based Language Model for Italian

Feb 11, 2021 | Educational

In the realm of Natural Language Processing, UmBERTo stands out as a specialized language model tailored for the Italian language. Built using the robust Roberta architecture, UmBERTo utilizes innovative techniques such as SentencePiece and Whole Word Masking to enhance its learning capabilities. In this article, we will walk through how to leverage UmBERTo for your projects, troubleshoot common issues, and understand its training nuances through simple analogies.

Understanding UmBERTo

UmBERTo is like a refined chef, expertly trained in the kitchen of Italian language corpuses. It carefully combines ingredients (data) from Wikipedia’s vast repository, mixing them to create a rich understanding of the language. Its training is meticulously done on a corpus of about 7GB from Wikipedia-ITA, ensuring that it has a substantial background to draw insights from.

Essential Features of UmBERTo

Base Model: Roberta
Innovative Techniques:
- SentencePiece
- Whole Word Masking
Training Data: Approximately 7GB from Wikipedia-ITA.

Using UmBERTo with the Transformers Library

Getting started with UmBERTo is as simple as pie! Follow these steps to load and use the model:

1. Load UmBERTo Wikipedia Uncased

First, we need to load the model and tokenizer. Here’s how you do it:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")

encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0)  # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output

2. Predicting Masked Tokens

With a few simple commands, you can predict masked tokens in a sentence. This ability is akin to a grammar-savvy friend filling in the blanks when you don’t remember a word!

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="Musixmatch/umberto-wikipedia-uncased-v1", tokenizer="Musixmatch/umberto-wikipedia-uncased-v1")
result = fill_mask("Umberto Eco è mask un grande scrittore")

Performance on Downstream Tasks

UmBERTo shines in Named Entity Recognition (NER) and Part of Speech (POS) tagging tasks. Here are some performance metrics:

Named Entity Recognition (NER)

Dataset	F1 Score	Precision	Recall	Accuracy
ICAB-EvalITA07	86.240	85.939	86.544	98.534
WikiNER-ITA	90.483	90.328	90.638	98.661

Part of Speech (POS)

Dataset	F1 Score	Precision	Recall	Accuracy
UD_Italian-ISDT	98.563	98.508	98.618	98.717
UD_Italian-ParTUT	97.810	97.835	97.784	98.060

Troubleshooting Tips

If you encounter any issues while using UmBERTo, consider the following troubleshooting options:

Ensure that the correct model and tokenizer names are used when loading UmBERTo.
Check your internet connection if you’re facing problems while downloading the model.
Make sure you are using compatible versions of the Transformers library.
If your input sentences are not providing good predictions, try adjusting the masks or phrasing.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

UmBERTo provides an innovative approach to NLP for the Italian language, efficiently processing information and predicting language structures effectively. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox