How to Implement Sentence Similarity with Sentence-Transformers

Dec 16, 2022 | Educational

The world of natural language processing (NLP) is rich with exciting possibilities, and one of the most intriguing areas is the concept of sentence similarity. By using sentence embeddings, we can transform sentences into vectors, enabling us to measure the similarity between them. In this guide, we will explore how to use a sentence-transformers model for semantic searches and clustering.

Understanding the Sentence-Transformers Model

Imagine you’re at a party with many guests, each person representing a sentence. The sentence-transformers model acts like a host who helps you pair your guests (sentences) who share similar interests (content). This model maps sentences and paragraphs into a dense vector space of 768 dimensions, enabling effective clustering and semantic search operations.

Getting Started with Sentence-Transformers

To begin your journey, you will need to install the sentence-transformers library. You can easily do this by running:

pip install -U sentence-transformers

Using the Model

Let’s see how to utilize the model in Python with an example:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer(MODEL_NAME)
embeddings = model.encode(sentences)
print(embeddings)

Alternatives with HuggingFace Transformers

If you want to use the model without the sentence-transformers library, you can also employ HuggingFace Transformers:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling Function
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Input sentences
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Evaluating the Model

For an automated evaluation of this model, visit the Sentence Embeddings Benchmark.

Training the Model

Understanding the training parameters can be crucial for optimizing your model’s performance. Key parameters include:

DataLoader: A torch DataLoader with 230 elements.
Batch Size: 16.
Loss Function: CosineSimilarityLoss.
Learning Rate: 2e-05.
Epochs: 1.

Full Model Architecture

The underlying architecture of the model is structured as follows:

SentenceTransformer(
  (0): Transformer(max_seq_length: 512, do_lower_case: False) with Transformer model: MPNetModel
  (1): Pooling(word_embedding_dimension: 768, pooling_mode_cls_token: False, pooling_mode_mean_tokens: True, pooling_mode_max_tokens: False, pooling_mode_mean_sqrt_len_tokens: False)
)

Troubleshooting

If you encounter any issues during installation or execution, consider the following troubleshooting tips:

Ensure you have the required libraries installed properly.
Check for typos in the model name or input sentences.
If using GPU, ensure that CUDA is configured correctly.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox