How to Use the SigLIP Model for Zero-Shot Image Classification

Jan 22, 2024 | Educational

In the realm of image classification, models continue to evolve, and the SigLIP model, pre-trained on the WebLi dataset, showcases this evolution with its advanced capabilities. This guide will help you understand how to leverage the SigLIP model for performing zero-shot image classification, along with troubleshooting tips to ensure a smooth experience.

Understanding the SigLIP Model

The SigLIP (sigmoid loss for Language Image Pre-Training) model operates with a shape-optimized architecture called SoViT-400m. Imagine it as a master chef who not only follows a recipe but also adapts it based on the ingredients available. In this metaphor, the ‘recipe’ refers to the image-text pairs used for training, while the ‘ingredients’ are the batch sizes. The model’s unique sigmoid loss function allows it to effectively handle these ‘ingredients’ without the need for a complete view, ultimately enhancing its performance on various tasks such as zero-shot image classification and image-text retrieval.

Getting Started with the Model

To begin using the SigLIP model for zero-shot image classification, follow these procedures:

Installation

Ensure you have the transformers library installed. You can do this via pip:

pip install transformers

Using the Model

Here’s a step-by-step approach:

Step 1: Load Necessary Libraries

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

Step 2: Initialize the Model and Processor

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

Step 3: Load an Image

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

Step 4: Prepare Text Labels

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

Step 5: Perform Inference

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

Alternative Simplified Method

If you prefer a simplified method, you can use the pipeline API which abstracts the complexity:

from transformers import pipeline

# load pipeline
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"]} for output in outputs]
print(outputs)

Troubleshooting

If the model is taking too long to process, ensure your image sizes are optimized and not too large.
Check the version compatibility of the transformers library with your Python environment.
If you encounter any API issues, try refreshing your network connection or testing with a different network.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

The SigLIP model, with its unique approach to loss functions and architecture, opens up new possibilities for image classification tasks. By following this guide, you should be equipped to explore its capabilities effectively. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox