Creating and Using the DISTIL-ITA-LEGAL-BERT Model for Sentence Similarity

Feb 18, 2023 | Educational

Welcome to our step-by-step guide on leveraging the DISTIL-ITA-LEGAL-BERT model! This blog will walk you through the process of understanding and implementing a powerful tool that uses knowledge distillation to generate sentence embeddings efficiently.

What is DISTIL-ITA-LEGAL-BERT?

DISTIL-ITA-LEGAL-BERT is a lightweight student model crafted using knowledge distillation from the more complex ITALIAN-LEGAL-BERT. Picture it like an apprentice learning the craft from a master—while the master (the teacher model) has a deep understanding and complexity, the apprentice (the student model) is quick and nimble, designed to deliver similar results more efficiently. This model becomes particularly useful for tasks such as clustering or semantic search by converting sentences into a compressed 768-dimensional vector space.

Getting Started with the Model

Before diving into implementation, ensure you have the necessary library installed. Here’s how:

Install the sentence-transformers library using pip:

pip install -U sentence-transformers

Using the Model with Sentence-Transformers

Once you have installed the sentence-transformers library, using the DISTIL-ITA-LEGAL-BERT model is straightforward. Below is a simple way to encode your sentences.

python
from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('dlicari/distil-ita-legal-bert')
embeddings = model.encode(sentences)

print(embeddings)

Using the Model with HuggingFace Transformers

If you prefer not to use the sentence-transformers library, you can still use the model with HuggingFace. Follow these steps:

python
from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('dlicari/distil-ita-legal-bert')
model = AutoModel.from_pretrained('dlicari/distil-ita-legal-bert')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Evaluation and Training of the Model

The model has been rigorously trained and evaluated using a substantial dataset, ensuring its reliability and accuracy. For an automated evaluation, you can check the Sentence Embeddings Benchmark.

Troubleshooting Common Issues

If you encounter issues while using the DISTIL-ITA-LEGAL-BERT model, here are a few troubleshooting tips:

Ensure that you have all the library dependencies installed correctly using the provided pip command.
Double-check that you are using the correct model name while loading the SentenceTransformer.
Verify that the sentences you are passing are properly formatted as a list.
If you have any performance issues, consider optimizing your code to batch process larger sets of sentences.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox