How to Use the MViT-v2 Model for Image Classification

Feb 14, 2024 | Educational

In the world of artificial intelligence, image classification plays a crucial role in various applications. One of the advanced models available for this purpose is the MViT-v2. This model is pretrained on a comprehensive dataset (ImageNet-22k) and fine-tuned for specific image classification tasks on a smaller dataset (ImageNet-1k). In this blog, we will explore how to utilize this powerhouse model for image classification and generating image embeddings.

Model Details

Before diving into the code, let’s understand some critical specifications of the MViT-v2 model:

Model Type: Image classification feature backbone
Parameters: 65.4M
GMACs: 10.2
Activations: 40.7M
Image Size: 224 x 224
Pretraining Dataset: ImageNet-22k
Fine-tuning Dataset: ImageNet-1k
Research Paper: MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

How to Use the MViT-v2 Model

Let’s break down the two main functionalities: image classification and image embeddings. Think of this process like using a car for different types of journeys: sometimes you need a smooth ride (classification), while other times you just want to take pictures of the beautiful scenery (embeddings).

1. Image Classification

To classify images using the MViT-v2 model, you will need to execute the following code:

python
from urllib.request import urlopen
from PIL import Image
import timm

# Load image from URL
img = Image.open(urlopen('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'))

# Load pretrained MViT-v2 model
model = timm.create_model('mvitv2_base_cls.fb_inw21k', pretrained=True)
model = model.eval()

# Get model-specific transformations for normalization and resizing
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

# Output the predictions
output = model(transforms(img).unsqueeze(0))  # Unsqueeze single image into a batch of 1
top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)

Explanation of the Code

Imagine you’re tuning a musical instrument before the concert (the model preparation). The first few steps load your image and the MViT-v2 model and ensure it’s in the right state (evaluation mode). The transformation step is like warming up the instrument; it prepares the image to meet the specific requirements of the model. Finally, you’re ready to play it (make predictions) and see which tunes are the best (top predictions).

2. Image Embeddings

If you want to extract features from images rather than classify them, follow these steps:

python
from urllib.request import urlopen
from PIL import Image
import timm

# Load image from URL
img = Image.open(urlopen('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'))

# Load pretrained MViT-v2 model with adjusted classifier
model = timm.create_model('mvitv2_base_cls.fb_inw21k', pretrained=True, num_classes=0)  # Remove classifier
model = model.eval()

# Get model-specific transformations for normalization and resizing
data_config = timm.data.resolve_model_data_config(model)
transforms = timm.data.create_transform(**data_config, is_training=False)

# Output the features
output = model(transforms(img).unsqueeze(0))  # Output as a shaped tensor
# Alternately
output = model.forward_features(transforms(img).unsqueeze(0))  # Unpooled output
output = model.forward_head(output, pre_logits=True)  # Output as feature tensor

Understanding Image Embeddings

Extracting image embeddings from the model can be compared to analyzing the nuances of an art piece without putting it on display. This process allows you to capture the essential features (attributes) of the image to use for various machine learning tasks without presenting it outright (classification).

Troubleshooting

If you encounter issues during implementation, here are some troubleshooting ideas:

Ensure that your environment has the necessary packages installed, such as timm and PIL.
Check the image URL—you need a valid image link to successfully load it.
For any unexpected errors, consider re-checking the transformations being applied to the image; they should fit the model’s expectations.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By leveraging the MViT-v2 model, you can perform efficient image classification and feature extraction with ease. Remember, understanding how the model works and adapting it to your needs empowers you to unlock its full potential.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox