MambaVision: A Hybrid Mamba-Transformer Vision Backbone

July 29, 2024

Welcome to our deep dive into MambaVision, the groundbreaking hybrid model for computer vision that cleverly combines the strengths of Mamba and Transformers. In this guide, we’ll explore how you can utilize MambaVision for image classification and feature extraction, complete with examples and troubleshooting tips. Grab your coding hat, and let’s get to work!

Model Overview

MambaVision is designed to enhance visual feature modeling. Imagine a chef blending two distinct techniques to create a new and exquisite dish; likewise, MambaVision amalgamates Mamba formulations with Vision Transformers (ViT). Our series of experiments, recognized through thorough ablation studies, illustrate that embedding self-attention blocks into the Mamba architecture notably augments its ability to capture long-range spatial dependencies. The end result? A family of MambaVision models with hierarchical structures tailored to meet various design goals!

Model Performance

MambaVision shines in performance metrics, marking a new state-of-the-art (SOTA) frontier in terms of Top-1 accuracy and throughput. Take a glance at the results:

Model Usage

To get started with MambaVision, the first step is to install the required packages. You can do so by executing the command below:

pip install mambavision

Image Classification

We’ll showcase how MambaVision can be applied to classify images from the COCO dataset. Below, you’ll find a simple code snippet:

from transformers import AutoModelForImageClassification
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModelForImageClassification.from_pretrained("nvidia/MambaVision-L-1K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)

input_resolution = (3, 224, 224)  # MambaVision supports any input resolution
transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()

# model inference
outputs = model(inputs)
logits = outputs['logits']
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Once you run this snippet, you might find that the predicted label is brown bear, bruin, Ursus arctos.

Feature Extraction

MambaVision also serves as a remarkable feature extractor. Similar to an artist extracting various colors from their palette, MambaVision can extract outputs at each model stage. Below is a code snippet for feature extraction:

from transformers import AutoModel
from PIL import Image
from timm.data.transforms_factory import create_transform
import requests

model = AutoModel.from_pretrained("nvidia/MambaVision-L-1K", trust_remote_code=True)

# eval mode for inference
model.cuda().eval()

# prepare image for the model
url = 'http://images.cocodataset.org/val2017/000000020247.jpg'
image = Image.open(requests.get(url, stream=True).raw)

input_resolution = (3, 224, 224)  # MambaVision supports any input resolution
transform = create_transform(input_size=input_resolution,
                             is_training=False,
                             mean=model.config.mean,
                             std=model.config.std,
                             crop_mode=model.config.crop_mode,
                             crop_pct=model.config.crop_pct)

inputs = transform(image).unsqueeze(0).cuda()

# model inference
out_avg_pool, features = model(inputs)
print("Size of the averaged pool features:", out_avg_pool.size())  # torch.Size([1, 640])
print("Number of stages in extracted features:", len(features)) # 4 stages
print("Size of extracted features in stage 1:", features[0].size()) # torch.Size([1, 80, 56, 56])
print("Size of extracted features in stage 4:", features[3].size()) # torch.Size([1, 640, 7, 7])

Troubleshooting

If you run into any hiccups while integrating MambaVision into your projects, here are a few troubleshooting tips:

Dependency Issues: Ensure that all required libraries are installed. Re-run the installation command if necessary.
Model Loading Errors: Double-check the model name and your network connection; a model unavailable error usually indicates a typo or connectivity issue.
Image Processing Problems: Ensure your input images are in the correct format and size as required by MambaVision.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

With MambaVision, the possibilities in computer vision have expanded, and we encourage you to explore the depths of this incredible model!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox

How to Use Stable-Retro: Your Guide to Reinventing Classic Games for Reinforcement Learning

September 26, 2024
Gated-Attention Architectures for Task-Oriented Language Grounding: A User’s Guide

September 19, 2024
DQN with PyTorch: A Guide to Mastering Deep Q-Learning on Atari Pong

September 17, 2024
Dive into Deep Reinforcement Learning with PyTorch

September 15, 2024
How to Use Pgx: A Reinforcement Learning Game Simulator

September 13, 2024
How to Request Access to the ChatterjeeLabPepMLM-650M Model

September 13, 2024