Unlocking the Power of Sentence Similarity with Sentence-Transformers

Nov 25, 2022 | Educational

In this guide, we will delve into how to utilize the sentence-transformers model for analyzing sentences and paragraphs. By mapping them to a 768-dimensional dense vector space, this model can be a game-changer for tasks such as clustering and semantic search. Let’s break down the steps necessary to harness this powerful tool!

Getting Started: Installing Sentence-Transformers

The first step to using the Sentence-Transformers model is to install the required library. Open your terminal and run the following command:

pip install -U sentence-transformers

Using Sentence-Transformers for Semantic Similarity

Once you have the library installed, working with the model becomes incredibly straightforward. Here’s how you can encode sentences for similarity comparison:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer(MODEL_NAME)
embeddings = model.encode(sentences)
print(embeddings)

In the code above:

  • You import the SentenceTransformer class.
  • Create a list of sentences that you want to compare.
  • You instantiate the model using MODEL_NAME (which should be replaced with your desired model’s name).
  • Finally, you encode the sentences which outputs their respective embeddings.

Using HuggingFace Transformers for More Flexibility

If you prefer to use the standard HuggingFace Transformers, here is a different approach you can take:

from transformers import AutoTokenizer, AutoModel
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] 
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["This is an example sentence", "Each sentence is converted"]
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:", sentence_embeddings)

In this variation:

  • You include mean pooling to adjust the embeddings based on the attention mask.
  • The model will be able to consider only relevant parts of the input sentences when generating the embeddings.

Evaluation of the Model

To evaluate the effectiveness of your model, you can refer to the Sentence Embeddings Benchmark. This resource provides fine-grained insights into how well your model performs in various setups.

Training Your Model

For those interested in training their model, here are the specifications:

  • DataLoader: A torch.utils.data.dataloader.DataLoader of length 1800.
  • Batch Size: Set to 4.
  • Loss Function: sentence_transformers.losses.CosineSimilarityLoss.

Parameters for the fit() method can include:

  • Epochs: 1
  • Learning Rate: 2e-05
  • Weight Decay: 0.01

Full Model Architecture

The architecture of your Sentence-Transformer can be broken down into two main components:

  • Transformer: Utilizing the MPNetModel.
  • Pooling: Configured to average the embeddings, ensuring you receive meaningful sentence representations.

Troubleshooting

Here are some common issues and their solutions:

  • Model not found: Ensure MODEL_NAME is correctly set to a valid model available in the HuggingFace Model Hub or sentence-transformers repository.
  • Installation issues: Double-check your Python environment. A virtual environment may help to avoid conflicts with other libraries.
  • Out of memory errors: If you are using a GPU, try reducing the batch size or leveraging gradient accumulation.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox