How to Use the Sentence-Transformers Library for Sentence Embeddings

Mar 31, 2024 | Educational

In recent years, the need for effective tools in natural language processing has surged, paving the way for powerful libraries like Sentence Transformers. This blog serves as a guide on how to utilize the sentence-transformers library, while also diving into some troubleshooting tips along the way.

What is Sentence Transformers?

The sentence-transformers library allows you to convert sentences or paragraphs into dense vector spaces, aiding in various tasks such as clustering and semantic search. In the current context, we will focus on the deprecated model known as nli-distilbert-base-max-pooling, which should be avoided due to its low-quality embeddings. Instead, recommended models can be found here.

Setting Up Your Environment

Before diving into coding, ensure that you have the sentence-transformers library installed:

pip install -U sentence-transformers

Using Sentence Transformers

Once installed, you can leverage this library as shown below. The analogy here is to think of the model as a magical book that translates sentences into numerical representations, enabling you to perform various advanced operations with them.

Example Code Using Sentence Transformers

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/nli-distilbert-base-max-pooling')
embeddings = model.encode(sentences)
print(embeddings)

Just as each page in the book holds a unique numerical interpretation of a sentence, each output vector represents the semantic meaning of the corresponding input sentence.

Using HuggingFace Transformers

If you prefer to work with HuggingFace’s framework, here’s how you can accomplish sentence embedding without the sentence-transformers library:

from transformers import AutoTokenizer, AutoModel
import torch

# Function to perform max pooling
def max_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    token_embeddings[input_mask_expanded == 0] = -1e9  # Set padding tokens to large negative values
    return torch.max(token_embeddings, 1)[0]

# Sentences for which we want embeddings
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/nli-distilbert-base-max-pooling')
model = AutoModel.from_pretrained('sentence-transformers/nli-distilbert-base-max-pooling')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling using max pooling
sentence_embeddings = max_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Troubleshooting Tips

As you navigate through this process, you might encounter challenges. Here are some tips to help you out:

  • Ensure you have the latest version of the sentence-transformers library by using the install command.
  • If you run into any errors related to model loading, check your internet connection since it downloads the model from the HuggingFace Hub.
  • For any issues related to TensorFlow or PyTorch, ensure you have the correct version installed according to the library requirements.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

In summary, the sentence-transformers library is a powerful tool for generating sentence embeddings, but it’s essential to choose the right model for quality output. Happy coding!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox