If you’re interested in diving into the world of multimodal embeddings, you’re in for a treat! In this article, we’ll explore how to effectively use Jina CLIP to retrieve both text and images, allowing you to bridge the gap between these two modalities. Let’s walk through the process step by step.
What is Jina CLIP?
Jina-clip-v1 is a cutting-edge English multimodal embedding model that can handle both text and image inputs efficiently. Traditional models, while good at specific tasks, often fall short when it comes to cross-modal retrieval. However, Jina CLIP excels in both text-to-text and text-to-image searches, making it an indispensable tool for various applications such as search and recommendation systems.
How to Get Started with Jina CLIP
Using Jina CLIP can be streamlined with a few steps.
Installation
- Begin by installing the necessary packages using pip:
!pip install transformers einops timm pillow
Using Jina CLIP with Transformers
Next, you can load the model and start processing your text and image data:
from transformers import AutoModel
# Initialize the model
model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)
# Define new meaningful sentences
sentences = ['A blue cat', 'A red cat']
# Public image URLs
image_urls = [
'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]
# Encode text and images
text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(image_urls)
Understanding the Code: The Train Analogy
Imagine you are conducting a train service that needs to run smoothly without any delays. Here, the text embeddings represent one train while the image embeddings represent another. You need both sets of trains to operate on the same tracks, which Jina CLIP facilitates. Just like trains stopping at a station to pick up passengers (text or images), this model aligns both modalities, making it possible to compute similarities effectively.
Compute Similarities
Once you have your embeddings, it’s time to calculate their similarities:
# Compute similarities
print(text_embeddings[0] @ text_embeddings[1].T) # text embedding similarity
print(text_embeddings[0] @ image_embeddings[0].T) # text-image cross-modal similarity
print(text_embeddings[0] @ image_embeddings[1].T) # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[0].T) # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[1].T) # text-image cross-modal similarity
Performance Metrics
This section illustrates how well Jina CLIP performs compared to other models in various text-image and text-text retrieval tasks. You can analyze metrics like Retrieval Rate (R@1 and R@5) to understand its efficiency.
Troubleshooting Common Issues
While using Jina CLIP, you might encounter a few common issues. Here are some troubleshooting tips:
- ValueError regarding model class: If you face a ValueError about inconsistent attribute configurations, try updating your Transformers library to a stable version >4.41.2 or <=4.40.0.
- Merging Similarities: If you’re unsure how to combine text-text and text-image similarities, consider using:
combined_scores = sim(text, text) + lambda * sim(text, image)
# pseudo code
query_document_mean = np.mean(cos_sim_text_texts)
query_document_std = np.std(cos_sim_text_texts)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)
query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Happy coding, and may your journeys into multimodal retrieval be as smooth as a well-oiled train service!

