How to Use the Dense Encoder MSMARCO DistilBERT for Sentence Similarity

Feb 27, 2022 | Educational

If you’re venturing into the realm of natural language processing (NLP) and are looking to enhance your text understanding capabilities, you’ve landed on the right blog! We’re diving into the dense encoder model, specifically the msmarco-distilbert-word2vec256k, designed for sentence similarity. In this guide, we will walk you through the setup, usage, and evaluation of this model.

Understanding the Model

The msmarco-distilbert-word2vec256k model utilizes a vocabulary of 256,000 words initialized with word2vec, and it has been trained using the MS MARCO dataset. If you’ve ever tried to find similarities between two sentences, imagine this model as a discerning librarian who can quickly compare how much two books (or sentences) relate to each other, helping you find the best references in no time.

Performance Metrics

When it comes to measuring performance, this model has shown impressive statistics:

  • MRR@10 on MS MARCO dev dataset
  • TREC-DL 2019: 65.53 (nDCG@10)
  • TREC-DL 2020: 67.42 (nDCG@10)
  • Average score on 4 BEIR datasets: 38.97

Setting Up the Model

Before you dive in, ensure you’ve installed the sentence-transformers library. You can do this via pip:

pip install -U sentence-transformers

Using the Model with Sentence-Transformers

Once you’ve installed the necessary library, you can utilize the model as follows:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer(MODEL_NAME)
embeddings = model.encode(sentences)

print(embeddings)

Using the Model with HuggingFace Transformers

If you prefer working without sentence-transformers, you can accomplish similar tasks through HuggingFace Transformers:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling function
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Evaluation Results

To validate the efficacy of your model, refer to the Sentence Embeddings Benchmark, which provides an automated evaluation framework.

Troubleshooting Tips

Here are a few troubleshooting ideas you might encounter:

  • Installation Issues: If you’re having trouble with the installation, ensure you have the latest version of pip or try reinstalling the libraries.
  • Model Not Found: Make sure you have the correct model name defined in your code.
  • Memory Errors: This often occurs with large datasets. Consider reducing your batch size or upgrading your hardware.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Training Details

The model training was conducted with specific parameters designed to optimize performance:

  • DataLoader: torch.utils.data.dataloader.DataLoader of length 7858
  • Batch size: 64
  • Loss Function: MarginMSELoss
  • Epochs: 30
  • Learning Rate: 2e-05

Full Model Architecture

This is how the model architecture looks:

SentenceTransformer(
    (0): Transformer(max_seq_length: 250, do_lower_case: False) with Transformer model: DistilBertModel
    (1): Pooling(word_embedding_dimension: 768, pooling_mode_cls_token: False, pooling_mode_mean_tokens: True, pooling_mode_max_tokens: False, pooling_mode_mean_sqrt_len_tokens: False)
)

Conclusion

Understanding and utilizing the dense encoder model, particularly in the context of sentence similarity, is vital for many applications ranging from search engines to recommendation systems. Embrace the power of NLP today, and enhance your projects with this robust model!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox