How to Use Sentence Similarity with Bertin-Roberta for Spanish

Apr 6, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_12_1344

The sentence similarity model using Bertin-Roberta is an exciting tool for bridging language understanding in Spanish. Trained specifically on NLI tasks, it provides a way to transform sentences into a high-dimensional vector space, allowing for tasks like semantic search and clustering. Here’s how to set it up and use it efficiently.

Setting Up the Framework

To get started with the Bertin-Roberta model, you’ll need to install the sentence-transformers library. Here are the steps you need to follow:

Open your terminal or command prompt.
Run the command: pip install -U sentence-transformers

Usage with Sentence-Transformers

Once you have the library installed, using the model is straightforward. You can convert your sentences into embeddings with the provided code:

from sentence_transformers import SentenceTransformer

sentences = ["Este es un ejemplo", "Cada oración es transformada"]
model = SentenceTransformer('hackathon-pln-esbertin-roberta-base-finetuning-esnli')
embeddings = model.encode(sentences)
print(embeddings)

Using HuggingFace Transformers

If you prefer not to use sentence-transformers, you can also call the model using HuggingFace’s Transformers library:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - To average correctly
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] 
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["This is an example sentence", "Each sentence is converted"]
tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-esbertin-roberta-base-finetuning-esnli')
model = AutoModel.from_pretrained('hackathon-pln-esbertin-roberta-base-finetuning-esnli')

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Understanding the Code: The Analogy of Building Blocks

Imagine constructing a structure with blocks, where each block represents a word in a sentence. The model first gathers all the individual blocks (word embeddings) and then sensitively stacks them together based on their relationships. Here’s how it works:

The encoder represents each word as a vector (block), capturing its meaning in context.
Pooling involves combining these blocks (vectors) into a single larger block (embedding) that maintains the meaning of the entire sentence.
The result is your well-structured sentence representation that can easily be compared to another structure for similarity.

Testing and Evaluating Your Model

Once you have your embeddings, you can evaluate their performance using various metrics. The model was evaluated on the SemEval-2015 Task, demonstrating significant improvements in cosine and euclidean similarities:

Cosine Pearson: 0.609803 increased to 0.683188 (+12.03%)
Euclidean Spearman: 0.526529 increased to 0.611539 (+16.15%)

Troubleshooting and Tips

If you encounter issues when using the model, consider the following troubleshooting tips:

Ensure you have the correct library versions installed.
Double-check your tokens and padding settings in the HuggingFace implementation.
Consult the documentation on sentence-transformers for additional insights.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Leveraging the Bertin-Roberta model opens up exciting possibilities for understanding sentence similarity in Spanish. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox