How to Use the Tomaarsen MPNet Base Model for Sentence Similarity

Feb 24, 2024 | Educational

Welcome to this guide on utilizing the Tomaarsen MPNet Base NLI Matryoshka model for sentence similarity tasks! This powerful model, part of the sentence-transformers library, transforms sentences into a 768-dimensional vector space, making it highly effective for clustering and semantic search.

What Do You Need?

  • Basic knowledge of Python programming.
  • Python environment with the required libraries installed.
  • Access to the internet for downloading model files.

Installation

Before diving in, you need to install the sentence-transformers package. Open your terminal and run:

pip install -U sentence-transformers

Usage of Sentence-Transformers

Now that the package is installed, you can convert sentences using the model. Here’s how:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence.", "Each sentence is converted."]
model = SentenceTransformer('tomaarsenmpnet-base-nli-matryoshka')
embeddings = model.encode(sentences)
print(embeddings)

Using HuggingFace Transformers

If you prefer to use HuggingFace transformers directly, follow these steps:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["This is an example sentence.", "Each sentence is converted."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('tomaarsenmpnet-base-nli-matryoshka')
model = AutoModel.from_pretrained('tomaarsenmpnet-base-nli-matryoshka')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Understanding the Code: An Analogy

Think of embeddings as a treasure map that directs you to various treasures (meanings) hidden in a vast ocean (language). The Tomaarsen MPNet model acts as a skilled navigator, guiding you through the intricacies of language. Each time you input a sentence, it maps it to a unique point on the treasure map, allowing you to find similarities between different sentences just like two treasure chests pointing to the same island!

Troubleshooting

If you encounter errors during installation or execution, consider the following tips:

  • Ensure you have the latest version of Python (preferably 3.6 and above).
  • Double-check that you have correctly installed the required packages using pip.
  • Review your code for any syntax errors, especially during variable declarations.
  • Check your internet connection, as models need to be downloaded from the HuggingFace Hub.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Model Evaluation

This model has been evaluated as per the *Sentence Embeddings Benchmark*. You can access more detailed evaluation results here.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox