The latest developments in the field of computer vision offer fascinating tools for image analysis. One such tool that stands out is the Vision Transformer (ViT) model, specifically the vit_base_patch16_224.dino variant, which utilizes the self-supervised DINO (self-Distillation with No labels) method for feature extraction. In this guide, we will walk you through the process of using this model for both image classification and obtaining image embeddings.
Model Details
Before we dive into implementation, let’s take a moment to appreciate the capabilities of the ViT base model:
- Model Type: Image classification feature backbone
- Parameters (M): 85.8
- GMACs: 16.9
- Activations (M): 16.5
- Image Size: 224 x 224
- Pretrain Dataset: ImageNet-1k
- Papers:
Using the Model for Image Classification
Let’s explore how to classify an image using the ViT model. Imagine you are an artist choosing the perfect colors for your new masterpiece; this model helps you find the best labels for your images!
python
from urllib.request import urlopen
from PIL import Image
import timm
import torch
# Load the image
img = Image.open(urlopen(https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png))
# Create model
model = timm.create_model('vit_base_patch16_224.dino', pretrained=True)
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
# Forward pass of the image through the model
output = model(transforms(img).unsqueeze(0)) # unsqueeze adds a batch dimension
# Get the top 5 predictions
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
In this code snippet: we are creating our masterpiece by loading an image, preparing the model, and then letting the model classify the image. The model gives us the top 5 predictions, much like an artist deciding on their five most prominent colors for a painting.
Using the Model for Image Embeddings
Beyond classification, the ViT can also help you extract unique features from the image—think of it as an artist capturing the essence of their subject in a detailed sketch.
python
from urllib.request import urlopen
from PIL import Image
import timm
# Load the image
img = Image.open(urlopen(https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png))
# Create model for embeddings
model = timm.create_model('vit_base_patch16_224.dino', pretrained=True, num_classes=0) # Removing classifier
model = model.eval()
# get model specific transforms (normalization, resize)
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)
# Extract features
output = model.forward_features(transforms(img).unsqueeze(0)) # Output shape (1, 197, 768)
In this code snippet: we adapt our model to remove the classification layer, allowing it to focus on extracting detailed features from the image—creating an intricate sketch rather than just picking colors!
Troubleshooting
While using the ViT model, you might encounter some challenges. Here are some common issues and solutions:
- Python environment issues: Make sure you have the necessary packages installed, such as timm, PIL, and torch. You can install them using pip:
pip install timm Pillow torch - Image not loading: Check if the image URL is correct and accessible. In case of a URL issue, try using a different image.
- Model errors: Ensure that the model name is correctly typed and corresponds to an existing model in the timm library.
- Batch size errors: Make sure to “unsqueeze” your input tensor if you pass a single image, as the model expects a batch.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
In conclusion, the ViT model proves to be a powerful ally in the realm of image classification and feature extraction. With its architecture based on self-supervised learning, it can not only classify images but also provide detailed embeddings. The analogy of an artist choosing colors and capturing essence beautifully summarizes the functionality of this model.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

