Welcome to the world of InternVL-Chat-V1-2! This multimodal large language model (MLLM) is designed to bridge the gap between visual and text data, providing an efficient means of handling image-text tasks. In this guide, we will walk you through the process of running this powerful AI model and help you troubleshoot any issues you may encounter.
Quick Start
To begin, ensure you have the appropriate version of transformers installed. We recommend using transformers==4.37.2 for optimal performance. Here’s how to load and run the model:
Model Loading
- Using 16-bit (bf16 / fp16):
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL-Chat-V1-2"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).eval().cuda()
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL-Chat-V1-2"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
trust_remote_code=True
).eval()
import math
import torch
from transformers import AutoTokenizer, AutoModel
def split_model(model_name):
device_map = {}
world_size = torch.cuda.device_count()
num_layers = 60 # Replace with actual number for your model
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[layer_cnt] = i
layer_cnt += 1
return device_map
path = "OpenGVLab/InternVL-Chat-V1-2"
device_map = split_model('InternVL-Chat-V1-2')
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map=device_map
).eval()
Using the Model for Inference
Now that you’ve loaded the model, here are some examples of how you can interact with it:
Text Conversation
question = "Hello, who are you?"
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f"User: {question}")
print(f"Assistant: {response}")
Image Interaction
To analyze an image, you will need to preprocess it:
from PIL import Image
image = Image.open('.examples/image2.jpg').resize((448, 448))
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values.to(torch.bfloat16).cuda()
question = "Please describe the image shortly."
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f"User: {question}")
print(f"Assistant: {response}")
Understanding the Code
Let’s think of the code setup as arranging a team of chefs in a large kitchen:
- Load the Ingredients: You first import necessary tools (libraries) like
torchandtransformerswhich are like pots and pans. - Setup the Chefs: Each chef (model component) is assigned their workstation (device). You need to arrange them wisely across the kitchen (GPUs) to ensure everyone is working efficiently together.
- Cook the Dish: When you run the model (i.e., when the chefs start cooking), you give them instructions (questions and images) to produce a meal (responses).
Troubleshooting Tips
If you encounter issues while implementing the model, try the following:
- Ensure that you are using the recommended version of the
transformerslibrary. - Double-check the compatibility of your GPU setup while utilizing multiple GPUs.
- For common errors related to memory usage, try adjusting the
low_cpu_mem_usageargument during model loading. - For assisted troubleshooting, explore community insights or collaborate with experts by visiting **[fxis.ai](https://fxis.ai)**.
Final Thoughts
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

