How to Implement Sentence Similarity with SentenceTransformer and Transformers

Apr 6, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_2_219

In this blog post, we’ll walk you through the process of implementing sentence similarity using the SentenceTransformer library in combination with Hugging Face’s Transformers. This guide is designed to give you a user-friendly approach to understanding and using these powerful tools for evaluating how similar sentences are based on their contextual meanings.

Preparing Your Environment

Before diving into the code, ensure you have the necessary packages installed. You can do this by running the following commands:

pip install transformers sentence-transformers torch scikit-learn

Understanding the Code Structure

Let’s break down the code into two main segments, similar to how a chef organizes a recipe into preparation and cooking phases.

1. Using SentenceTransformer

In this phase, we encode sentences into vectors using the SentenceTransformer. Imagine you are at a library and each book (sentence) is transformed into its essence (vector), making it easier for you to compare them.

import os
import torch
from transformers import AutoModel, AutoTokenizer
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize

texts = ["hello world"]
model_dir = "MODEL_PATH"

# SentenceTransformer part
model = SentenceTransformer(model_dir)
vectors = model.encode(texts, convert_to_numpy=True, normalize_embeddings=True)
print(vectors.shape)
print(vectors[:, :4])

The model encodes texts into numerical vectors. In our analogy, you now have a summary of every book at hand, enabling you to see which ones talk about the same themes.

2. Leveraging Transformers

Next, we transform the vectors using Transformers. Here, you are deepening your understanding of those books by extracting even finer details.

vector_dim = 4096
model = AutoModel.from_pretrained(model_dir).eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir)

vector_linear = torch.nn.Linear(in_features=model.config.hidden_size, out_features=vector_dim)
vector_linear_dict = {k.replace("linear.", ""): v for k, v in torch.load(os.path.join(model_dir, "f2_Dense_vector_dim_pytorch_model.bin")).items()}
vector_linear.load_state_dict(vector_linear_dict)

with torch.no_grad():
    input_data = tokenizer(texts, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    
    vectors = normalize(vector_linear(vectors).cpu().numpy())
  
print(vectors.shape)
print(vectors[:, :4])

Here, you’re enhancing the summary of the book by understanding the context and nuances. The final vectors give you a clear representation to compare and assess sentence similarity.

Troubleshooting Tips

If you encounter any issues while implementing this, consider the following troubleshooting ideas:

Ensure that MODEL_PATH is correctly set to the directory where your model is stored. This should contain the relevant model files.
Check that you have the necessary libraries installed. Use the pip commands given above and ensure all packages are compatible with your Python version.
If the input texts are not being encoded correctly, verify the format of your input data. Make sure your sentences are in the form of a list of strings.
If the output shape does not match your expectations, revisit the vector dimension settings in the code.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

We have successfully unraveled how to implement sentence similarity using the SentenceTransformer and Transformers libraries. By understanding these operations, you are equipped to analyze and interpret the semantics of text in a proficient way.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox