How to Implement Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking

Aug 16, 2024 | Educational

In today’s world of information overload, identifying relevant products in response to search queries can be challenging. This post will guide you on leveraging Generalized Contrastive Learning (GCL) for enhancing the ranking performance of information retrieval models, particularly when it comes to multi-modal retrieval (text, image, etc.).

What is Generalized Contrastive Learning?

Generalized Contrastive Learning is a method that aims to improve the way models learn and recognize patterns in diverse data types. It’s particularly useful in scenarios where you need to match choices (like products) from a search query with their descriptions.

Why Use GCL for Multi-Modal Retrieval?

Improved Performance: The rank results obtained using GCL (E5) show significant improvement compared to traditional methods.
Versatility: GCL can handle different types of data, making it suitable for real-world applications.

For example, in a product search, the model not only processes the text of the query but also understands context and meaning, leading to better recommendations.

Getting Started: Usage Instructions

To implement GCL, follow the steps below:


import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

# Each input text should start with query: or passage:
input_texts = [
    "query: Espresso Pitcher with Handle",
    "query: Women’s designer handbag sale",
    "passage: Dianoo Espresso Steaming Pitcher, Espresso Milk Frothing Pitcher Stainless Steel",
    "passage: Coach Outlet Eliza Shoulder Bag - Black - One Size"
]

tokenizer = AutoTokenizer.from_pretrained("Marqo/marqo-gcl-e5-large-v2-130")
model_new = AutoModel.from_pretrained("Marqo/marqo-gcl-e5-large-v2-130")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=77, padding=True, truncation=True, return_tensors='pt')
outputs = model_new(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

In this code snippet, think of the input_texts as ingredients for a recipe. Just as ingredients need to be prepared correctly to create a delicious dish, your inputs need to be structured with specific prefixes (like “query:” and “passage:”) to guide the model on how to interpret them.

Understanding the Code

The script begins by importing necessary libraries and defining a function to average the pool of hidden states. The average_pool function is like a blender that takes all the flavor components (hidden states) and mixes them into a unified, delicious smoothie (pooled tensor).

Next, the input texts are tokenized, essentially preparing the ingredients for cooking. The model processes these inputs, yielding output embeddings which represent the learned features. Normalization ensures that the model’s predictions are on the same scale, akin to adjusting the seasoning in a dish for balanced flavor.

Troubleshooting

If you encounter difficulties during the implementation, consider the following troubleshooting steps:

Check Dependencies: Ensure you have the right versions of libraries like `torch` and `transformers` installed.
Input Formatting: Verify that your input texts begin with the correct prefixes – missing these can lead to poor model performance.
Shape Mismatch: If you receive errors related to shapes during tensor operations, double-check the batch dimensions and the tensor shapes.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Incorporating Generalized Contrastive Learning in your retrieval models can vastly improve the relevance of the results. By applying contrastive principles to multi-modal data, you can create systems that are not only efficient but also smart in their decision-making processes.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox