Using the Salesforce/SFR-Embedding Model for Text Embedding

Jul 1, 2024 | Educational

In the ever-evolving landscape of Natural Language Processing (NLP), employing effective text embedding techniques can significantly enhance your models’ performance. Today, we will explore how to utilize Salesforce’s SFR-Embedding-2_R, a powerful model designed for various text-related tasks.

Getting Started with SFR-Embedding-2_R

To set up and use the SFR-Embedding model, you will need to ensure you have Python and the required libraries installed. This includes Torch and Transformers, among others.

Steps to Use the Model

Import Necessary Libraries

First, import the required modules:

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

Define Embedding Functions

Define functions to handle the processing of input text and retrieval of embeddings:

def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

Load the Model and Tokenizer

Next, load the pre-trained model and tokenizer for SFR-Embedding:

tokenizer = AutoTokenizer.from_pretrained('Salesforce/SFR-Embedding-2_R')
model = AutoModel.from_pretrained('Salesforce/SFR-Embedding-2_R')

Prepare Input Data

Create the task instructions and the passages you want to analyze:

queries = [
    get_detailed_instruct(task, 'How to bake a chocolate cake'),
    get_detailed_instruct(task, 'Symptoms of the flu')
]

passages = [
    "To bake a delicious chocolate cake, you'll need the following ingredients: all-purpose flour, sugar...",
    "The flu, or influenza, is an illness caused by influenza viruses..."
]

Get Embeddings

Use the tokenizer to convert your text into embeddings:

input_texts = queries + passages
batch_dict = tokenizer(input_texts, max_length=4096, padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

Normalize Embeddings and Calculate Similarity Scores

Finally, normalize your embeddings and compute the similarity scores:

embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

Code Explanation through Analogy

Think of the SFR-Embedding model like a well-trained chef in a bustling kitchen. The ingredients—queries and passages—are your raw materials. When you import necessary libraries, it’s akin to gathering your pots, pans, and utensils. The “last_token_pool” function acts like a sous-chef who knows how to pick the perfect ingredient for the dish based on recipe instructions (or attention masks). Once everything is prepared, it’s time to create your culinary masterpiece! The model takes in the ingredients, combines them using its trained recipes (pre-trained functions), and ultimately serves up a delicious score of similarity like a dish ready to impress! Just as a good meal comes together in stages, so too does your task with processed queries and calculated scores.

Troubleshooting Common Issues

Model Loading Errors: Ensure that you have the correct version of the model and that your environment is set up properly. If you receive errors about missing libraries, double-check your Python packages.
Invalid Input Format: Make sure that your text inputs are correctly formatted as strings. If the model is not accepting them, check your data type.
Device Compatibility: If running on a GPU, ensure that your PyTorch is installed with CUDA support. This can alleviate performance-related issues.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox

Using the Salesforce/SFR-Embedding Model for Text Embedding

Getting Started with SFR-Embedding-2_R

Steps to Use the Model

Code Explanation through Analogy

Troubleshooting Common Issues

Let’s Build Success Together