How to Use the Sentence-Transformers Library

Mar 28, 2024 | Educational

Welcome! In this guide, we’ll explore the Sentence-Transformers library, focusing on how to effectively use the deprecated model nli-distilbert-base for sentence embeddings. While this specific model is no longer recommended due to its low output quality, it serves as a good entry point to understand how sentence-transformers can revolutionize your text processing tasks.

What is Sentence-Transformers?

The Sentence-Transformers library allows you to map sentences and paragraphs into a 768-dimensional dense vector space. This is invaluable for tasks including clustering, semantic search, and sentence similarity.

Installation

Let’s get started by ensuring you have the Sentence-Transformers package installed. You can do this using pip:

pip install -U sentence-transformers

Usage

Here, you’ll learn how to generate sentence embeddings using both the Sentence-Transformers library and HuggingFace Transformers.

Using Sentence-Transformers

To use the Sentence-Transformers model, you can run the following Python code:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence.", "Each sentence is converted."]
model = SentenceTransformer('sentence-transformers/nli-distilbert-base')
embeddings = model.encode(sentences)
print(embeddings)

Using HuggingFace Transformers

If you prefer not to use Sentence-Transformers, here’s how you can accomplish the same with HuggingFace:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want embeddings for
sentences = ["This is an example sentence.", "Each sentence is converted."]
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/nli-distilbert-base')
model = AutoModel.from_pretrained('sentence-transformers/nli-distilbert-base')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Understanding the Code Through an Analogy

Think of the process of generating embeddings like preparing a perfect dish using several ingredients.

Ingredients (Sentences): These are the sentences you wish to transform. Just as you gather the best ingredients for your dish, you input high-quality sentences.
Preparation (Tokenization): This involves breaking down sentences into tokens (words) much like chopping vegetables before cooking. Tokenization ensures that each part of your input is ready for processing.
Cooking (Encoding): After preparation, you load the model (like a pot on the stove) and apply the cooking techniques (encoder), generating raw but delicious multiple “embeddings” (outputs).
Final Touch (Pooling): Last but not least, you choose the best way to present the dish (pooling technique), ensuring every element is neatly arranged for maximum flavor (meaning and context).

Evaluation Results

For further insights and automated evaluation of this model’s performance, check out the Sentence Embeddings Benchmark.

Troubleshooting

Having issues with the model? Here are some common troubleshooting ideas:

Model Deprecated: Remember this model is deprecated. Seek alternatives from SBERT.net – Pretrained Models.
Installation Problems: Ensure that you’re using an updated version of Python and the sentence-transformers library. You may need to reinstall with the pip command given above.
Performance Issues: Check that your input sentences are properly formatted and look for any unexpected characters that might affect encoding.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox