How to Use the E5-large Model for Sentence Transformers

Aug 8, 2023 | Educational

The E5-large model from Hugging Face provides powerful text embeddings useful for various natural language processing tasks, including classification, retrieval, and clustering. In this guide, we’ll effortlessly navigate through the setup process, usage, and troubleshooting steps to ensure a seamless experience. Let’s dive in!

Getting Started with E5-large Model

To use the E5-large model for sentence transformers, you must ensure you have all the prerequisites, which include:

  • Python installed
  • The Hugging Face Transformers library
  • The Sentence Transformers package

Installation

Use the following command to install the required packages:

pip install sentence_transformers~=2.2.2

Using the E5-large Model

The below example illustrates how to effectively utilize the E5-large model to encode queries and passages:


import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

# Define average pooling function
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

# Define input texts
input_texts = [
    "query: how much protein should a female eat",
    "query: summit define",
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.",
    "passage: Definition of summit for English Language Learners: 1. the highest point of a mountain; 2. the highest level; 3. a meeting or series of meetings between the leaders of two or more governments."
]

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-large")
model = AutoModel.from_pretrained("intfloat/e5-large")

# Tokenize inputs
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)

# Calculate embeddings
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)

# Compute similarity scores
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

Understanding the Code

Imagine you’re at a party where you’re trying to decipher conversations and identify individuals based on their dialogues, much like how this code processes input text.

  • Input Texts: Each person’s statement (query or passage) is prefixed clearly to ensure everyone knows whom they’re responding to, akin to using labels in our conversation.
  • Tokenizer: This function decodes our spoken language (the text) into tokens that the model can understand, similar to translating spoken words into written notes.
  • Model Outputs: The responses from the conversations are encoded into vector representations, enabling us to compute similarities—much like comparing notes from different party-goers.

Training and Benchmark Evaluation

To understand the benchmarks of the E5-large model, you can refer to this paper or check out unilme5 for reproducing evaluation results.

Troubleshooting

Even the best evenings can face a few hiccups. Here’s how to address common challenges:

  • Issue:** Performance degradation when not using the prefixes.
    Solution:** Always ensure to prepend the input with “query: ” or “passage: ” to maintain the model’s integrity.
  • Issue:** Variability in reproduced results.
    Solution:** Ensure consistent versions of Transformers and PyTorch are installed.
  • Issue:** Unexpected output values for cosine similarity scores.
    Solution:** Remember that relative scores are what count, not the absolute values. The disparity is expected due to prior weight adjustments.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Embarking on the journey with the E5-large model is an exciting endeavor. Your ability to handle sentence transformations efficiently will open doors to numerous applications in natural language processing. Whether it’s querying or clustering text, this model focuses on transforming how we interpret language data!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox