How to Use the MSMARCO DistilBERT Model for Semantic Search

Mar 30, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_5_1128

In the world of artificial intelligence, understanding the essence of phrases is vital for tasks such as semantic search. Today, we’re delving into the MSMARCO DistilBERT model, an impressive tool that maps sentences and paragraphs into a 768-dimensional dense vector space.

What is Semantic Search?

Semantic search is a process that improves search accuracy by understanding the intent of the searcher. Instead of merely matching keywords, it analyzes the meaning behind the words, yielding results that are contextually relevant.

How to Use the MSMARCO DistilBERT Model

This model is affiliated with the sentence-transformers library and has been trained on 500k query-answer pairs from the MS MARCO Passages dataset. Here’s how you can implement it.

Step 1: Installation

Make sure you have the sentence-transformers library installed. You can do this via pip:

pip install -U sentence-transformers

Step 2: Loading the Model

Here’s an analogy to help you grasp how this model works: Imagine a librarian (the model) who knows where all the relevant books (documents) are stored in a library based on numerous inquiries (queries). This library is vast, and every question can have multiple related books.

Now when you ask, “How many people live in London?”, the librarian uses a system (the model) to find the most relevant books from your query.

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London.", "London is known for its financial district."]

# Load the model
model = SentenceTransformer('sentence-transformers/msmarco-distilbert-cos-v5')

# Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

# Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Step 3: Alternative Usage with HuggingFace Transformers

If you prefer not to use the sentence-transformers library, you can operate with HuggingFace Transformers. This requires a slightly different setup but provides the same functionality.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Encode text
def encode(texts):
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)
    
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings

query = "How many people live in London?"
docs = ["Around 9 Million people live in London.", "London is known for its financial district."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/msmarco-distilbert-cos-v5')
model = AutoModel.from_pretrained('sentence-transformers/msmarco-distilbert-cos-v5')

# Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

# Compute dot score
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Troubleshooting

Here are some common troubleshooting tips if you encounter issues:

Ensure that you have the correct version of the sentence-transformers library installed.
Make sure your Python environment has PyTorch installed, as it’s necessary for running models.
If you receive errors about model loading, check your internet connection or verify the model’s name is correctly spelled.
For normalization or scoring issues, ensure correct pooling methods are applied as per the library documentation.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Technical Details

Here’s a quick overview of some important parameters:

Dimensions: 768
Produces normalized embeddings: Yes
Pooling-Method: Mean pooling
Suitable score functions: dot-product, cosine-similarity, or euclidean distance

Note: When loaded with sentence-transformers, the model produces normalized embeddings with a length of one. In this context, dot-product and cosine-similarity are equivalent, although dot-product is preferred for speed.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox