In the world of artificial intelligence, understanding the essence of phrases is vital for tasks such as semantic search. Today, we’re delving into the MSMARCO DistilBERT model, an impressive tool that maps sentences and paragraphs into a 768-dimensional dense vector space.
What is Semantic Search?
Semantic search is a process that improves search accuracy by understanding the intent of the searcher. Instead of merely matching keywords, it analyzes the meaning behind the words, yielding results that are contextually relevant.
How to Use the MSMARCO DistilBERT Model
This model is affiliated with the sentence-transformers library and has been trained on 500k query-answer pairs from the MS MARCO Passages dataset. Here’s how you can implement it.
Step 1: Installation
Make sure you have the sentence-transformers library installed. You can do this via pip:
pip install -U sentence-transformers
Step 2: Loading the Model
Here’s an analogy to help you grasp how this model works: Imagine a librarian (the model) who knows where all the relevant books (documents) are stored in a library based on numerous inquiries (queries). This library is vast, and every question can have multiple related books.
Now when you ask, “How many people live in London?”, the librarian uses a system (the model) to find the most relevant books from your query.
from sentence_transformers import SentenceTransformer, util
query = "How many people live in London?"
docs = ["Around 9 Million people live in London.", "London is known for its financial district."]
# Load the model
model = SentenceTransformer('sentence-transformers/msmarco-distilbert-cos-v5')
# Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)
# Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))
# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
# Output passages & scores
for doc, score in doc_score_pairs:
print(score, doc)
Step 3: Alternative Usage with HuggingFace Transformers
If you prefer not to use the sentence-transformers library, you can operate with HuggingFace Transformers. This requires a slightly different setup but provides the same functionality.
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output.last_hidden_state
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Encode text
def encode(texts):
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input, return_dict=True)
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
return embeddings
query = "How many people live in London?"
docs = ["Around 9 Million people live in London.", "London is known for its financial district."]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/msmarco-distilbert-cos-v5')
model = AutoModel.from_pretrained('sentence-transformers/msmarco-distilbert-cos-v5')
# Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)
# Compute dot score
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))
# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
# Output passages & scores
for doc, score in doc_score_pairs:
print(score, doc)
Troubleshooting
Here are some common troubleshooting tips if you encounter issues:
- Ensure that you have the correct version of the sentence-transformers library installed.
- Make sure your Python environment has PyTorch installed, as it’s necessary for running models.
- If you receive errors about model loading, check your internet connection or verify the model’s name is correctly spelled.
- For normalization or scoring issues, ensure correct pooling methods are applied as per the library documentation.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Technical Details
Here’s a quick overview of some important parameters:
- Dimensions: 768
- Produces normalized embeddings: Yes
- Pooling-Method: Mean pooling
- Suitable score functions: dot-product, cosine-similarity, or euclidean distance
Note: When loaded with sentence-transformers, the model produces normalized embeddings with a length of one. In this context, dot-product and cosine-similarity are equivalent, although dot-product is preferred for speed.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

