How to Use Jina CLIP for Multimodal Retrieval

Jun 11, 2024 | Educational

If you’re interested in diving into the world of multimodal embeddings, you’re in for a treat! In this article, we’ll explore how to effectively use Jina CLIP to retrieve both text and images, allowing you to bridge the gap between these two modalities. Let’s walk through the process step by step.

What is Jina CLIP?

Jina-clip-v1 is a cutting-edge English multimodal embedding model that can handle both text and image inputs efficiently. Traditional models, while good at specific tasks, often fall short when it comes to cross-modal retrieval. However, Jina CLIP excels in both text-to-text and text-to-image searches, making it an indispensable tool for various applications such as search and recommendation systems.

How to Get Started with Jina CLIP

Using Jina CLIP can be streamlined with a few steps.

Installation

  • Begin by installing the necessary packages using pip:
  • !pip install transformers einops timm pillow

Using Jina CLIP with Transformers

Next, you can load the model and start processing your text and image data:

from transformers import AutoModel

# Initialize the model
model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)

# Define new meaningful sentences
sentences = ['A blue cat', 'A red cat']

# Public image URLs
image_urls = [
    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]

# Encode text and images
text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(image_urls)

Understanding the Code: The Train Analogy

Imagine you are conducting a train service that needs to run smoothly without any delays. Here, the text embeddings represent one train while the image embeddings represent another. You need both sets of trains to operate on the same tracks, which Jina CLIP facilitates. Just like trains stopping at a station to pick up passengers (text or images), this model aligns both modalities, making it possible to compute similarities effectively.

Compute Similarities

Once you have your embeddings, it’s time to calculate their similarities:

# Compute similarities
print(text_embeddings[0] @ text_embeddings[1].T)  # text embedding similarity
print(text_embeddings[0] @ image_embeddings[0].T)  # text-image cross-modal similarity
print(text_embeddings[0] @ image_embeddings[1].T)  # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[0].T)  # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[1].T)  # text-image cross-modal similarity

Performance Metrics

This section illustrates how well Jina CLIP performs compared to other models in various text-image and text-text retrieval tasks. You can analyze metrics like Retrieval Rate (R@1 and R@5) to understand its efficiency.

Troubleshooting Common Issues

While using Jina CLIP, you might encounter a few common issues. Here are some troubleshooting tips:

  • ValueError regarding model class: If you face a ValueError about inconsistent attribute configurations, try updating your Transformers library to a stable version >4.41.2 or <=4.40.0.
  • Merging Similarities: If you’re unsure how to combine text-text and text-image similarities, consider using:
  • combined_scores = sim(text, text) + lambda * sim(text, image)
  • Or apply z-score normalization before merging:
  • # pseudo code
    query_document_mean = np.mean(cos_sim_text_texts)
    query_document_std = np.std(cos_sim_text_texts)
    text_image_mean = np.mean(cos_sim_text_images)
    text_image_std = np.std(cos_sim_text_images)
    
    query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
    text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
    

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Happy coding, and may your journeys into multimodal retrieval be as smooth as a well-oiled train service!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox