A Guide to Leveraging Sentence Similarity with Sentence Transformers

Mar 3, 2023 | Educational

In our rapidly-evolving world of artificial intelligence, understanding how to evaluate and cluster textual data effectively has become paramount. This is where the Sentence Transformers come into play. In this blog, we’ll delve into how to use these powerful models to extract meaningful sentence embeddings and achieve insights through semantic search and clustering.

Understanding Sentence Transformers

The Sentence Transformers model translates sentences and paragraphs into a 768-dimensional dense vector space. Think of this like a highly specialized translator, converting languages into a format that computers can understand and manipulate. This enables us to perform tasks such as clustering similar phrases or conducting semantic searches with remarkable accuracy.

How to Use Sentence Transformers

Getting started with Sentence Transformers requires you to have the library installed:

pip install -U sentence-transformers

Once installed, here’s how to use the model for sentence embedding:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer(MODEL_NAME)
embeddings = model.encode(sentences)
print(embeddings)

Using HuggingFace Transformers without Sentence-Transformers

If you prefer using the HuggingFace Transformers library, here’s how you can achieve similar results:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Correctly average embeddings based on attention mask
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] 
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:", sentence_embeddings)

Evaluating Model Performance

To evaluate the effectiveness of your model, you can leverage the Sentence Embeddings Benchmark, which provides an automated way to assess the performance of the Sentence Transformers.

Model Training Overview

The model uses an adaptable DataLoader and a loss function designed for ranking. Key parameters include:

  • DataLoader: __main__.PubmedLowMemoryLoader with length 26041, batch_size: 128
  • Loss: MultipleNegativesRankingLoss with scale 20.0
  • Training Parameters:
    • epochs: 1
    • evaluation_steps: 2000
    • optimizer: AdamW, learning rate: 2e-05

Full Model Architecture

The architecture of the SentenceTransformer incorporates a Transformer model and a pooling mechanism, specifically designed to process sequences of textual data efficiently.

SentenceTransformer(
  (0): Transformer(max_seq_length: 128, do_lower_case: False) with Transformer model: BertModel
  (1): Pooling(word_embedding_dimension: 768, pooling_mode_cls_token: False, pooling_mode_mean_tokens: True)
)

Troubleshooting Tips

If you encounter issues while setting up or using the model, consider the following troubleshooting ideas:

  • Ensure all required libraries are installed and updated.
  • Check your Python version; compatibility can be crucial.
  • If you face memory issues, try reducing the batch size.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox