NoInstruct-small Embedding v0

May 6, 2024 | Educational

NoInstruct Embedding: Asymmetric Pooling is All You Need

This model has improved retrieval performance compared to the avsolatorio/GIST-small-Embedding-v0 model. One of the things that the GIST family of models fell short on is the performance on retrieval tasks. We propose a method that produces improved retrieval performance while maintaining independence on crafting arbitrary instructions, a trending paradigm in embedding models for retrieval tasks when encoding a query. Technical details of the model will be published shortly.

Usage

To use the NoInstruct-small Embedding model, follow these steps:

1. Installing Required Libraries

Ensure you have the necessary libraries installed. You need PyTorch and Hugging Face Transformers. You can install them using:

PyTorch: pip install torch
Transformers: pip install transformers

2. Loading the Model

You can load the model in Python as follows:


from typing import Union
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")
tokenizer = AutoTokenizer.from_pretrained("avsolatorio/NoInstruct-small-Embedding-v0")

3. Function to Get Embeddings

This function allows you to obtain embeddings for text input:


def get_embedding(text: Union[str, list[str]], mode: str = "sentence"):
    model.eval()  
    assert mode in ("query", "sentence"), f"{mode} was passed but only query and sentence are the supported modes."
    
    if isinstance(text, str):
        text = [text]
    inp = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        output = model(**inp)
    
    # The model is optimized to use the mean pooling for queries, while the sentence document embedding uses the [CLS] representation.
    if mode == "query":
        vectors = output.last_hidden_state * inp["attention_mask"].unsqueeze(2)
        vectors = vectors.sum(dim=1) / inp["attention_mask"].sum(dim=-1).view(-1, 1)
    else:
        vectors = output.last_hidden_state[:, 0, :]
    
    return vectors

4. Example Usage

Now, you can compute embeddings and cosine similarity:


texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational datasets child table is modeled using a sequence-to-sequence (Seq2Seq) model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands."
]

# Compute embeddings
embeddings = get_embedding(texts, mode="sentence")

# Compute cosine-similarity for each pair of sentences
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())

# Test the retrieval performance.
query = get_embedding("Which sentence talks about concept on jobs?", mode="query")
scores = F.cosine_similarity(query, embeddings, dim=-1)
print(scores.cpu().numpy())

Troubleshooting

If you encounter issues while using the NoInstruct-small Embedding model, consider the following troubleshooting tips:

Error: Modules Not Found – Ensure that you have installed all the required libraries.
Error: Model Not Found – Verify you are using the correct model name when loading.
Error: Dimension Mismatch – Check that your input lengths and tensor dimensions match appropriately for the operations you are performing.
Performance Issues – Ensure your inputs are structured correctly as per the requirements of the model.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox