How to Use E5-Base-4k for Long Context Retrieval

May 18, 2024 | Educational

In the realm of Natural Language Processing (NLP), the understanding of long texts is becoming increasingly vital. E5-Base-4k, a state-of-the-art model, extends the capabilities of embedding models to tackle long context retrieval tasks. This blog post provides a user-friendly guide on how to effectively utilize the E5-Base-4k model. So, grab your digital toolkit and let’s dive in!

Understanding the Structure of E5-Base-4k

The E5-Base-4k model comprises 12 layers with an embedding size of 768. Imagine it as a layered cake where each layer adds a distinct flavor of information, making the final product (the output embeddings) richer. Each layer processes the input to capture nuanced meanings and relationships, much like how different layers in a cake interact to create a delightful treat.

How to Encode Queries and Passages

To get started with E5-Base-4k, you’ll need to encode queries and passages specifically formatted for the model. Below are the steps broken down in a way that’s easy to follow:

  • Ensure you have the required libraries installed: torch and transformers.
  • Import the necessary packages at the beginning of your script:
  • import torch
    import torch.nn.functional as F
    from torch import Tensor
    from transformers import AutoTokenizer, AutoModel
  • Define helper functions to manage the encoding process:
  • def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    
    def get_position_ids(input_ids: Tensor, max_original_positions: int=512, encode_max_length: int=4096) -> Tensor:
        position_ids = list(range(input_ids.size(1)))
        factor = max(encode_max_length // max_original_positions, 1)
        if input_ids.size(1) == max_original_positions:
            position_ids = [(pid * factor) for pid in position_ids]
            
        position_ids = torch.tensor(position_ids, dtype=torch.long)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
        return position_ids
  • Prepare your input texts by using the appropriate prefixes as shown:
  • input_texts = [
        "query: how much protein should a female eat",
        "query: summit define",
        "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day.",
        "passage: Definition of summit for English Language Learners: 1 the highest point of a mountain."
    ]
  • Use the model’s tokenizer and load the pretrained model:
  • tokenizer = AutoTokenizer.from_pretrained("dwzhue5-base-4k")
    model = AutoModel.from_pretrained("dwzhue5-base-4k")
  • Tokenize the input texts and manage the position IDs:
  • batch_dict = tokenizer(input_texts, max_length=4096, padding=True, truncation=True, return_tensors='pt')
    batch_dict["position_ids"] = get_position_ids(batch_dict["input_ids"], max_original_positions=512, encode_max_length=4096)
  • Run the model to get embeddings and normalize them:
  • outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict["attention_mask"])
    embeddings = F.normalize(embeddings, p=2, dim=1)
    scores = (embeddings[:2] @ embeddings[2:].T) * 100
    print(scores.tolist())

Benchmark Evaluation

To validate your implementation, benchmark evaluations can be viewed online. These comparisons allow you to ascertain the performance and reliability of the E5-Base-4k model on various datasets.

Troubleshooting Tips

While using E5-Base-4k, you may encounter some common issues. Here’s how to address them:

  • Issue: Model not loading or importing errors. Check that all necessary libraries are installed and properly updated to their latest versions.
  • Issue: Error in tensor dimensions. Ensure that the input shape matches expected dimensions, especially for inputs exceeding maximum lengths.
  • Issue: Low performance metrics. Review your input prompt structure and ensure you’re following the format correctly.
  • For persistent issues, feel free to seek assistance and insights. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox