How to Use the Sentence-BERT Base Italian Model for Sentence Similarity

Mar 24, 2023 | Educational

The Sentence-BERT Base Italian Model is a powerful tool that transforms sentences and paragraphs into a 768-dimensional dense vector space. This model is ideal for various tasks, such as clustering similar sentences or conducting semantic searches.

Getting Started with Sentence-Transformers

Using this model is a breeze, particularly when you have the sentence-transformers library installed. If you haven’t done that yet, here’s a quick guide:

  • Open your command line interface.
  • Run the following command:
  • pip install -U sentence-transformers

Once you have the library set up, you can start utilizing the model right away.

Using the Model

Let’s explore how to use the model in two different contexts: with and without the Sentence-Transformers library.

Using Sentence-Transformers

Here’s how you can use the model directly with the Sentence-Transformers library:

from sentence_transformers import SentenceTransformer

sentences = ["Una ragazza si acconcia i capelli.", "Una ragazza si sta spazzolando i capelli."]
model = SentenceTransformer('nickprock/sentence-bert-base-italian-uncased')
embeddings = model.encode(sentences)
print(embeddings)

Using HuggingFace Transformers

If you prefer the HuggingFace Transformers library, follow this approach:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling function
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want embeddings for
sentences = ["Una ragazza si acconcia i capelli.", "Una ragazza si sta spazzolando i capelli."]

# Load model from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained('nickprock/sentence-bert-base-italian-uncased')
model = AutoModel.from_pretrained('nickprock/sentence-bert-base-italian-uncased')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling - here, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Understanding the Pooling Method

Let’s compare the mean pooling function within the context of the HuggingFace Transformers library to a simple analogy:

Imagine you are in a classroom with students (tokens) who each have different grades (embeddings). The teacher (mean pooling function) gathers all the students and calculates the average of their grades, taking care to ensure that students who were absent (attention masks) don’t skew the results. This ensures that only the students who actually participated contribute to the final grade that represents the class’s overall performance (sentence embedding).

Troubleshooting Tips

If you encounter any issues while using the Sentence-BERT model, here are some troubleshooting ideas:

  • Ensure that the sentence-transformers library is correctly installed.
  • Double-check sentence formatting and ensure they are in the correct language (Italian in this case).
  • If the model doesn’t seem to be returning the expected embeddings, verify that you’re using the correct model name and that the input sentences are appropriately tokenized.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Model Evaluation

The performance of this model can be gauged using the Sentence Embeddings Benchmark. This benchmark will offer automated evaluation metrics to assess how well your model performs with various inputs.

Model Training Overview

The training parameters for the model include:

  • DataLoader: A DataLoader with a length of 360, batch_size of 16, utilizing random sampling.
  • Loss Function: Uses CosineSimilarityLoss for training.
  • Optimization: Employs the AdamW optimization method with a learning rate of 2e-05.

Conclusion

The Sentence-BERT Base Italian Model is an effective way to extract semantic embeddings from textual data. With the guides provided here, you can easily implement it in your own projects and uncover valuable insights from language processing tasks.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox