How to Utilize Sentence-Transformers for Semantic Search

Mar 30, 2024 | Educational

In the world of Natural Language Processing (NLP), accurately capturing the meaning of sentences is essential. With tools like Sentence-Transformers, you can convert sentences and paragraphs into a 384-dimensional dense vector space, making tasks like clustering and semantic search much simpler. This blog post will guide you on how to use the sentence-transformers/paraphrase-MiniLM-L12-v2 model for these tasks and provide troubleshooting tips along the way.

Getting Started with Sentence-Transformers

To dive into using the sentence-transformers library, the first step is installation. Open your command line interface and run the following command:

pip install -U sentence-transformers

Once you have the library installed, you can start working with your sentences. The steps below will show you how to extract meaningful sentence embeddings.

Using Sentence-Transformers for Sentence Embeddings

Here’s how to utilize the sentence-transformers/paraphrase-MiniLM-L12-v2 model:


from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L12-v2')
embeddings = model.encode(sentences)
print(embeddings)

The above code snippet initializes the model, encodes the sentences, and prints out the resulting embeddings. Think of this process like transforming words into a numerical language so that machines can understand their meaning better, much like translating a foreign language into your native tongue.

Using HuggingFace Transformers without Sentence-Transformers

If you prefer to use HuggingFace Transformers directly without the sentence-transformers library, follow these steps. Here’s how you can compute sentence embeddings using the same model:


from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-MiniLM-L12-v2')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-MiniLM-L12-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

This method involves tokenizing the sentences, executing the model, and then utilizing a pooling strategy to ensure the context of the sentence is maintained. It is akin to sorting through a library where each book is carefully examined, and the essence of each is captured succinctly.

Evaluation Results

If you’re interested in evaluating the performance of this model, consider checking the Sentence Embeddings Benchmark for automated evaluations.

Troubleshooting

If you run into any issues while using the sentence-transformers library, here are a few ideas to troubleshoot:

  • Ensure you have the latest version of the sentence-transformers library installed.
  • Check that your Python environment has the required packages installed, such as torch or transformers.
  • Refer to the official documentation for any alterations in the API or usage.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

In this blog post, we explored how to use the sentence-transformers/paraphrase-MiniLM-L12-v2 model to create sentence embeddings effectively. These embeddings can significantly enhance the capabilities of your NLP tasks, whether it’s for semantic search or clustering sentences based on their meanings. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox