How to Use E5-Mistral-7B for Text Embeddings

Apr 25, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_24_21

Gone are the days when text embeddings were just a handful of numbers bound to specific contexts. With the E5-Mistral-7B model, we can generate rich embeddings capable of capturing complex relationships in text data. But how do we utilize this model effectively? In this article, we’ll walk you through the steps to encode queries and passages using E5-Mistral-7B.

Getting Started

Before we dive into the code, let’s get familiar with the E5-Mistral-7B model. This model has 32 layers, and its embedding size is a whopping 4096. It is specifically designed for enhancing text embedding capabilities. Let’s see how we can use it.

Setup

To begin using the E5-Mistral-7B model, ensure you have the sentence-transformers library installed. If you haven’t done that yet, you can do so using pip:

pip install sentence-transformers

Encoding Queries and Passages

Here’s an illustrative analogy: think of the embedding process like cooking a special dish. Each ingredient (your queries and passages) needs to be chopped (encoded), mixed together (processed), and finally plated (the embeddings). Here’s how you can accomplish that:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")

# Optional: set maximum sequence length
model.max_seq_length = 4096

queries = [
    "how much protein should a female eat",
    "summit define",
]

documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

query_embeddings = model.encode(queries, prompt_name="web_search_query")
document_embeddings = model.encode(documents)

scores = (query_embeddings @ document_embeddings.T) * 100
print(scores.tolist())

In this code, we import the necessary libraries, define our model, and prepare our queries and documents. The line scores = (query_embeddings @ document_embeddings.T) * 100 computes the similarity scores between the queries and documents, providing insight into how well they match.

Transformers Implementation

If you’re looking to customize your model further, you could also implement functionalities using the transformers library. Below is another analogy: consider this like setting up a custom kitchen with unique tools to enhance your cooking experience:

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, 'summit define')
]

documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."
]

input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = AutoModel.from_pretrained('intfloat/e5-mistral-7b-instruct')
max_length = 4096

# Tokenize the inputs
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

Troubleshooting

Why do I see performance degradation? Ensure that your queries are accompanied by concise task definitions, as it’s critical for optimal performance.
Why are my results different from what I expected? Check the versions of transformers and pytorch being used, as discrepancies can lead to slight performance variations.
Where can I find the LoRA-only weights? You can find them here.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Using the E5-Mistral-7B model for text embedding can tremendously enhance your natural language processing tasks. Just remember, every great dish begins with excellent ingredients, meticulous preparation, and the right tools at your disposal.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox