Unlocking Semantic Search with Sentence-Transformers

Mar 29, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_28_64

In the rapidly advancing world of artificial intelligence, the ability to understand and compare sentences is essential. The sentence-transformers library, particularly the msmarco-distilbert-base-tas-b model, offers a powerful tool for semantic search applications. This blog will guide you through the process of using this model, as well as provide troubleshooting tips to ensure your experience is smooth and error-free.

What is Sentence-Transformers?

The sentence-transformers library allows you to convert sentences into dense vector representations in a high-dimensional space. Think of this process like transforming a colored painting into a monochrome number grid. Each sentence is represented by a set of numbers that capture its meaning, future facilitating comparison in semantic search tasks.

Getting Started: Installation

To begin using this model, you need to have the sentence-transformers library installed. You can do this with a simple pip command:

pip install -U sentence-transformers

Using the Model

After installing the necessary library, you can easily load and use the msmarco-distilbert-base-tas-b model. Below, we’ll walk through the steps for both the sentence-transformers installation method and the HuggingFace Transformer’s approach, allowing you to choose based on your preferences.

1. Using Sentence-Transformers

Here’s how to implement semantic search using the sentence-transformers method:

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load the model
model = SentenceTransformer('sentence-transformers/msmarco-distilbert-base-tas-b')

# Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

# Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

2. Using HuggingFace Transformers

Alternatively, if you prefer to use the HuggingFace library, here’s how you can do that:

from transformers import AutoTokenizer, AutoModel
import torch

# CLS Pooling - Take output from first token
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

# Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)
    # Perform pooling
    embeddings = cls_pooling(model_output)
    return embeddings

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/msmarco-distilbert-base-tas-b')
model = AutoModel.from_pretrained('sentence-transformers/msmarco-distilbert-base-tas-b')

# Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

# Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Understanding the Code: An Analogy

Imagine you have a set of recipes (documents) and a specific dish in mind (the query). Just like how you would skim through various recipes to find the one that closely matches your dish, the model operates similarly. Each recipe is transformed into a numerical format that represents its context. The model calculates how similar each recipe (document) is to your dish (query) based on a numerical score, allowing you to retrieve the best match efficiently.

Troubleshooting

As you embark on your semantic search journey, you may encounter some common issues. Here are a few troubleshooting tips:

Issue: Installation Errors – Ensure that you are using a compatible version of Python and the required libraries. If you run into issues, try reinstalling the libraries or updating pip.
Issue: Model Not Found – Verify the model string you are using. Ensure that it matches the name for the model as specified in the documentation.
Issue: Poor Performance – Make sure you’re providing well-formed sentences. Models perform best when given clear and context-rich queries.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By mastering the usage of the msmarco-distilbert-base-tas-b model, you unlock the doors to advanced semantic search capabilities. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox