Getting Started with MSMARCO-MiniLM-L6-Cos-v5 for Semantic Search

Mar 31, 2024 | Educational

Welcome to your user-friendly guide on utilizing the MSMARCO-MiniLM-L6-cos-v5 model for semantic search using sentence-transformers. This powerful model maps sentences and paragraphs to a 384-dimensional dense vector space, enabling efficient and meaningful search capabilities backed by a robust set of training data.

Why Use MSMARCO-MiniLM-L6-Cos-v5?

This model has been trained on approximately 500,000 (query, answer) pairs from the MS MARCO Passages dataset, making it ideal for semantic search applications. Imagine the model as a highly intelligent librarian that not only understands your question but also knows exactly where to find the books or articles that contain the most relevant information.

How to Implement the Model

Using the MSMARCO-MiniLM-L6-Cos-v5 model can be quite straightforward. Here’s a step-by-step guide:

1. Install Required Library

First, ensure that you have the sentence-transformers library installed. Use the following command:

pip install -U sentence-transformers

2. Sample Code for Usage

Here’s a sample implementation to get you started:

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load the model
model = SentenceTransformer('sentence-transformers/msmarco-MiniLM-L6-cos-v5')

# Encode the query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

# Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Understanding the Code with an Analogy

Think of your query as a question you’d like to ask a wise friend (the model). You present them with a stack of books (the documents), and your friend reads through them. Each book contains information related to your question (if any). Your friend then assigns a score to each book based on how well it answers your question and shares the results with you starting from the most relevant.

Using HuggingFace Transformers as an Alternative

If you prefer utilizing the HuggingFace Transformers library, follow these steps:

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Mean Pooling - Take average of all tokens
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Encode text
def encode(texts):
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings

# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/msmarco-MiniLM-L6-cos-v5')
model = AutoModel.from_pretrained('sentence-transformers/msmarco-MiniLM-L6-cos-v5')

# Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

# Compute dot score between query and document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

# Combine docs and scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Technical Details to Note

Dimensions: 384
Produces Normalized Embeddings: Yes
Pooling-Method: Mean pooling
Suitable Score Functions: dot-product (util.dot_score), cosine-similarity (util.cos_sim), or euclidean distance

Troubleshooting

If you encounter issues while implementing this model, consider the following troubleshooting tips:

Ensure that you have installed all necessary libraries using the correct versions.
Check for compatibility issues between the library versions you are using.
Make sure you are using a compatible Python version.
If you’re using GPU, ensure that your configurations are set correctly for it.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox