Welcome to a journey of unlocking the power of the Japanese CLIP (Contrastive Language-Image Pre-training) model! This model, developed by LY Corporation, is designed to conduct various visual tasks like zero-shot image classification and text-to-image retrieval.
Getting Started
To start using the CLIP Japanese base model, we will break down the installation and execution process into simple steps. Let’s gear up!
1. Install Required Packages
First, you need to install a few necessary Python packages that facilitate the smooth operation of the model. Open your terminal and run the following command:
pip install pillow requests sentencepiece transformers torch timm
2. Run the Model
Now that we have the prerequisites in place, it’s time to execute the model. Below, you’ll find a Python code snippet to get it started:
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
HF_MODEL_PATH = "line-corporation/clip-japanese-base"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
image = Image.open(io.BytesIO(requests.get("https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260").content))
image = processor(image, return_tensors="pt")
text = tokenizer(["犬", "猫", "象"])
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
This code snippet performs the following tasks:
- Loads the necessary libraries, as if packing your bag for a journey.
- Grabs the image from a URL, like picking a postcard from your favorite destination.
- Processes the image and associated texts (or labels), akin to sorting through souvenirs.
- Calculates the probabilities that the image matches each text label, just as you might rank your favorite experiences from best to least exciting!
Model Architecture
The architecture of this model consists of an Eva02-B Transformer as the image encoder and a 12-layer BERT for the text encoder. The text encoder has been initialized from rinna/japanese-clip-vit-b-16.
Evaluation
The performance of the model can be evaluated against various datasets, such as:
- STAIR Captions for image-to-text and text-to-image retrieval.
- Recruit Datasets for image classification.
- ImageNet-1K, where all class names are translated into Japanese.
Results
Here’s a quick glance at our model’s performance compared to others:
Model | Image Encoder Params | Text Encoder Params | STAIR Captions (R@1) | Recruit Datasets (acc@1) | ImageNet-1K (acc@1)
-------------------|---------------------|---------------------|-----------------------|---------------------------|---------------------
Ours | 86M (Eva02-B) | 100M (BERT) | 0.30 | 0.89 | 0.58
Stable-ja-clip | 307M (ViT-L) | 100M (BERT) | 0.24 | 0.77 | 0.68
Rinna-ja-clip | 86M (ViT-B) | 100M (BERT) | 0.13 | 0.54 | 0.56
Laion-clip | 632M (ViT-H) | 561M (XLM-RoBERTa) | 0.30 | 0.83 | 0.58
Hakuhodo-ja-clip | 632M (ViT-H) | 100M (BERT) | 0.21 | 0.82 | 0.46
Troubleshooting
If you encounter any issues while using the CLIP Japanese model, here are some tips to consider:
- Ensure that all packages are correctly installed. Running `pip install` multiple times shouldn’t hurt.
- Check your internet connection if the image fails to download.
- Verify that the correct model path is being used in your code.
- If a memory error occurs, consider reducing batch sizes or image dimensions.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
