How to Use MobileCLIP: A Fast Image-Text Model

Jul 25, 2024 | Educational

readme_2_192

MobileCLIP is a revolutionary framework introduced in the paper MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. This blog guides you step-by-step on how to effectively utilize MobileCLIP for your own projects, specifically the MobileCLIP-S0 checkpoint.

Understanding MobileCLIP

To put MobileCLIP into context, imagine it as a high-speed train that efficiently transports you between two cities: images and text. Just like a train is designed for speed and comfort, MobileCLIP provides a seamlessly integrated platform that handles the complexities of image-text pair processing, allowing researchers and developers to achieve fantastic results without the need for overly bulky models.

Highlights of MobileCLIP

MobileCLIP-S0 matches the zero-shot performance of OpenAI’s ViT-B16 model, but is 4.8x faster and 2.8x smaller.
MobileCLIP-S2 surpasses the average zero-shot performance of SigLIP’s ViT-B16 model, being 2.3x faster and 2.1x smaller, while trained with three times fewer samples.
MobileCLIP-B (LT) achieves an impressive ImageNet zero-shot performance of 77.2%, outpacing models like DFN and SigLIP.

How to Use MobileCLIP

Getting started with MobileCLIP is straightforward and involves the following steps:

First, download the desired checkpoint by visiting the links provided in the checkpoints table and clicking on the “Files and versions” tab.
To download programmatically, ensure you have huggingface_hub installed and run the command: huggingface-cli download pcuenqMobileCLIP-S0.
Install the ml-mobileclip library by following the instructions provided in the repository. It features an API similar to the open_clips library.
Run the following code snippet for inference:

import torch
from PIL import Image
import mobileclip

model, _, preprocess = mobileclip.create_model_and_transforms('mobileclip_s0', pretrained='path_to_mobileclip_s0.pt')
tokenizer = mobileclip.get_tokenizer('mobileclip_s0')

image = preprocess(Image.open('docs/fig_accuracy_latency.png').convert('RGB')).unsqueeze(0)
text = tokenizer(['a diagram', 'a dog', 'a cat'])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

image_features = image_features.norm(dim=-1, keepdim=True)
text_features = text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print('Label probs:', text_probs)

Troubleshooting

While using MobileCLIP, you may encounter challenges. Here are some troubleshooting ideas:

Ensure that you have installed all required packages correctly. A missing library can lead to import errors.
Verify that the checkpoint files are correctly downloaded. Any interruption during download can corrupt files.
If you face issues with image processing, ensure that the images are in an accepted format and the path is correctly specified.
For normalization issues with image and text features, make sure that both features are computed correctly and consistent in dimensions.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox