Welcome to your guide on utilizing the NLLB-SigLIP-MRL model for zero-shot image classification! This model brilliantly combines text and image encoders, extending its capabilities across 201 languages. In this article, we’ll walk you through a step-by-step process of how to implement this model effectively, troubleshoot common issues, and understand the functionality behind it. Let’s leap right in!
Model Summary
The NLLB-SigLIP-MRL model integrates a text encoder from the NLLB model and an image encoder from the SigLIP model. It boasts the potential for multilingual image and text retrieval, setting new benchmarks in this domain!
Thanks to a sophisticated method called Matryoshka Representation learning, this model creates embeddings of various sizes (32, 64, 128, 256, 512, and the original 1152), allowing versatility without losing quality. The results speak for themselves, with embeddings of 256 and 512 maintaining over 90% of the quality of the full embedding!
How to Use the Model
To implement the model for variable embedding sizes, follow these steps:
Step 1: Install necessary libraries
!pip install -U transformers open_clip_torch
Step 2: Import Libraries and Load Model
from transformers import AutoModel
from PIL import Image
import requests
import torch
model = AutoModel.from_pretrained('visheratinnllb-siglip-mrl-large', device='cpu', trust_remote_code=True)
Step 3: Load and Prepare Your Image
image_path = 'https://huggingface.co/spaces/jjourney1125/swin2s/resolvemainsamples/butterfly.jpg'
image = Image.open(requests.get(image_path, stream=True).raw)
Step 4: Define Class Options and Languages
class_options = ['бабочка', 'butterfly', 'kat']
class_langs = ['rus_Cyrl', 'eng_Latn', 'afr_Latn']
Step 5: Get Logits for Image Classification
image_logits, text_logits = model.get_logits(
images=[image],
texts=class_options,
langs=class_langs,
resolution=512 # set resolution here or set None to use the original resolution
)
print(torch.softmax(image_logits, dim=1))
Understanding the Code: An Analogy
Imagine you’re running a library where each book represents a class option, and your guests talk in different languages. The NLLB-SigLIP-MRL model is like the library assistant who knows the contents of each book and can quickly recognize which book corresponds to what the guests are saying!
- The guests (input images) enter the library.
- The language (class options) for each guest is set by the model.
- The assistant (model) then retrieves the appropriate book (class) based on the conversation (responses) happening at the desk, ensuring everything runs smoothly.
Troubleshooting Common Issues
If you encounter issues while using the model, here are some steps you can take:
- Ensure that all libraries are correctly installed and updated.
- Check that the image URL is accessible. Verify the image link provided.
- Confirm your device compatibility (CPU vs. GPU) in case you experience performance issues.
- Inspect the class option and language pairs to ensure they are correctly aligned before making predictions.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

