How to Use EVA-CLIP-18B: The Largest Open-Source CLIP Model

Feb 9, 2024 | Educational

Welcome to your ultimate guide on utilizing the EVA-CLIP-18B model, a groundbreaking advancement in contrastive language-image pretraining (CLIP). With a massive 18 billion parameters, this model sets new standards for performance across various image classification tasks. Below, we’ll explore its features, usage instructions, and troubleshooting tips, ensuring you can leverage this powerhouse effectively.

Summary of EVA-CLIP Performance

  • Achieves an impressive 80.7% zero-shot top-1 accuracy across 27 benchmarks.
  • Outperforms earlier models with fewer parameters, showcasing the advantages of scaling.
  • Utilizes a refined dataset of 2-billion image-text pairs from the LAION-2B and COYO-700M datasets.

Model Card

EVA-CLIP-8B

Model Name: EVA-CLIP-8B
Total Parameters: 8.1B
Average Accuracy: 79.4%
Download Weights: [Download PyTorch Weights](https://huggingface.co/BAAI/EVA-CLIP-8B)

EVA-CLIP-18B

Model Name: EVA-CLIP-18B
Total Parameters: 18.1B
Average Accuracy: 80.7%
Download Weights: Stay tuned for the release!

Usage Instructions

To harness the capabilities of EVA-CLIP-18B, you can utilize it in either the Hugging Face framework or through direct PyTorch implementation. Below are the instructions for both approaches:

Using Hugging Face Version

python
from PIL import Image
from transformers import AutoModel, AutoConfig, CLIPImageProcessor, CLIPTokenizer
import torch

image_path = "CLIP.png"
model_name_or_path = "BAAI/EVA-CLIP-8B"
image_size = 224

processor = CLIPImageProcessor.from_pretrained(model_name_or_path)
model = AutoModel.from_pretrained(model_name_or_path, torch_dtype=torch.float16).to("cuda").eval()
image = Image.open(image_path)
captions = ["a diagram", "a dog", "a cat"]
tokenizer = CLIPTokenizer.from_pretrained(model_name_or_path)

input_ids = tokenizer(captions, return_tensors="pt", padding=True).input_ids.to("cuda")
input_pixels = processor(images=image, return_tensors="pt", padding=True).pixel_values.to("cuda")

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(input_pixels)
    text_features = model.encode_text(input_ids)
    label_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
print(f"Label probs: {label_probs}")

Using PyTorch Version

python
import torch
from eva_clip import create_model_and_transforms, get_tokenizer
from PIL import Image

model_name = "EVA-CLIP-8B"
pretrained = "eva_clip" 
image_path = "CLIP.png"
captions = ["a diagram", "a dog", "a cat"]
device = "cuda" if torch.cuda.is_available() else "cpu"

model, _, processor = create_model_and_transforms(model_name, pretrained, force_custom_clip=True)
tokenizer = get_tokenizer(model_name)
model = model.to(device)
image = processor(Image.open(image_path)).unsqueeze(0).to(device)
text = tokenizer(captions).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Troubleshooting

If you run into any issues while using EVA-CLIP-18B, here are some troubleshooting steps to consider:

  • Memory Issues: Ensure you have sufficient GPU memory. If you encounter memory overflow, consider using DeepSpeed for model loading optimization.
  • Import Errors: Make sure you have installed all required libraries, especially transformers, torch, and relevant image processing libraries.
  • Model Download: Check your internet connection when attempting to download model weights. If issues persist, try downloading from a different network.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

The EVA-CLIP-18B model is a monumental step in the field of multimodal AI, embodying the convergence of vision and language understanding. By equipping yourself with this knowledge and utilizing the available resources, you can unlock the potential for innovative applications and research developments in the world of AI.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox