Unlocking Semantic Search: A Guide to Using Sentence-Transformers

Mar 30, 2024 | Educational

Have you ever wondered how search engines understand the meaning of your queries? Or how computers can determine if two sentences are similar? The magic often lies in something called sentence embeddings. In this article, we’ll explore the sentence-transformers library, particularly the quora-distilbert-multilingual model, and guide you through its practical applications.

What are Sentence-Transformers?

Sentence-transformers transform sentences or paragraphs into a 768-dimensional dense vector space. Think of this space as a giant library where each book (sentence) is represented not by its title, but by its essence, enabling efficient clustering and semantic searches.

Getting Started with Sentence-Transformers

Step 1: Install the Library

First, ensure you have the sentence-transformers library installed. You can do this using pip:

pip install -U sentence-transformers

Step 2: Using Sentence-Transformers

Once installed, using the model is straightforward. Here’s how:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/quora-distilbert-multilingual')
embeddings = model.encode(sentences)
print(embeddings)

In this code, we are essentially telling the model: “Hey, convert these sentences into their essence!” The model then produces vectors representing the sentences, similar to how an author summarizes a book’s storyline into a captivating blurb.

Using HuggingFace Transformers

If you prefer not to use sentence-transformers, you can leverage HuggingFace Transformers to achieve similar results. Here’s a quick rundown:

from transformers import AutoTokenizer, AutoModel
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["This is an example sentence", "Each sentence is converted"]

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/quora-distilbert-multilingual')
model = AutoModel.from_pretrained('sentence-transformers/quora-distilbert-multilingual')

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Here, we’re building our own mechanism for sentence transformation through HuggingFace. The mean pooling function acts like a judge at a talent show, selecting the best parts of the performance (token embeddings) to give an overall score (sentence embedding).

Evaluation Results

For an automated evaluation of this model, check out the Sentence Embeddings Benchmark.

Troubleshooting Common Issues

Here are a few troubleshooting tips to help you along the way:

Import Errors: Ensure that the required libraries are correctly installed. Running pip install -U sentence-transformers transformers torch should resolve most issues.
Memory Issues: If your system is running out of memory, try decreasing the batch size or using smaller datasets.
Model Not Found: Double-check the model name for any typos while loading models. Make sure to use sentence-transformers format correctly.
Invalid Inputs: Make sure your input sentences are in a list format; otherwise, the model won’t know what to process!

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Full Model Architecture

The model includes the following architecture:

SentenceTransformer(
  (0): Transformer(
    max_seq_length: 128,
    do_lower_case: False
    with Transformer model: DistilBertModel
  )
  (1): Pooling(
    word_embedding_dimension: 768,
    pooling_mode_cls_token: False,
    pooling_mode_mean_tokens: True,
    pooling_mode_max_tokens: False,
    pooling_mode_mean_sqrt_len_tokens: False
  )
)

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

We hope this guide helps you navigate the exciting realm of sentence-transformers and transform your text processing workflows!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox