How to Use InternVL 2.0 for Multimodal Tasks

Oct 28, 2024 | Educational

Welcome to your ultimate guide on using InternVL 2.0, a powerful multimodal language model that leverages both vision and language for impressive AI capabilities. With the release of the InternVL2-8B model, you can unlock cutting-edge performance for various tasks, including image and video understanding. Let’s dive into how you can get started.

Getting Started with InternVL 2.0

Before you begin utilizing this AI marvel, ensure you have the required Python environment set up. You’ll need the transformers library, specifically version 4.37.2. Here’s how to install it:

pip install transformers==4.37.2

Loading the Model

You have various options for loading the InternVL2-8B model, depending on your hardware setup. Let’s explore these methods:

1. 16-bit and 8-bit Quantization: A Simple Approach

For those using compatible GPUs, here’s how you can load the model:

import torch
from transformers import AutoTokenizer, AutoModel

path = "OpenGVLab/InternVL2-8B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

2. Multi-GPU Setup: Avoid Errors

If you’re using multiple GPUs, the following example will prevent tensor mismatches:

import math
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = {'InternVL2-8B': 32}[model_name]
    
    num_layers_per_gpu = math.ceil(num_layers / world_size)
    for i in range(world_size):
        for j in range(num_layers_per_gpu):
            device_map[f'model.layers.{i * num_layers_per_gpu + j}'] = i
            
    return device_map

path = "OpenGVLab/InternVL2-8B"
device_map = split_model('InternVL2-8B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map).eval()

Performing Inference with InternVL 2.0

Now that you have the model loaded, it’s time to perform some inference! Here we offer a few examples to illustrate how you can interact with the model:

Single-Image Conversation

Want to have a conversation about an image? Here’s how:

import torch
from transformers import AutoTokenizer

image_path = 'examples/image1.jpg'
pixel_values = load_image(image_path).to(torch.bfloat16).cuda()
question = "Please describe the image shortly."

response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f"User: {question}\nAssistant: {response}")

Multi-Image Dialogue

What if you have multiple images to analyze? The example below shows how to do that:

pixel_values1 = load_image('examples/image1.jpg').to(torch.bfloat16).cuda()
pixel_values2 = load_image('examples/image2.jpg').to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
question = "Describe the two images in detail."

response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f"User: {question}\nAssistant: {response}")

Understanding the Code with an Analogy

Think of loading and interacting with the InternVL model like navigating through a multi-level library. Each model represents a specific section with its unique reference materials (weights and layers) stored in various compartments (GPUs). Instead of running around aimlessly looking for books (tensors), you have a librarian (program) that intelligently directs your queries to the correct sections, ensuring that each interaction is seamless.

Troubleshooting

While working with the InternVL 2.0 model, you may encounter a few hiccups. Here are some common issues and solutions:

  • Import Errors: If you face any import-related issues, ensure that all required libraries are installed. Use the below command to check and install missing packages:
  • pip install -r requirements.txt
  • Out-of-Memory Errors: If you’re encountering memory issues, consider using load_in_8bit=True or loading the model on a single GPU.
  • Chat Performance: If the chat functionality seems slow or unresponsive, examining your hardware resources or network configurations might reveal the issue.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With InternVL 2.0, you can harness powerful multimodal capabilities that bridge vision and language seamlessly. Whether you’re describing images, answering questions, or conducting complex analyses, the model stands ready to assist. Keep experimenting and leverage the features for your advanced AI projects!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox