How to Use InternVL-Chat-V1-2-Plus for Multimodal Interaction

Aug 10, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_16_185

Welcome to a guide on using the powerful InternVL-Chat-V1-2-Plus, a multimodal large language model that excels in image-text interactions. Get ready to explore its features by following our user-friendly instructions!

Understanding InternVL-Chat-V1-2-Plus

Before diving into practical applications, let’s understand what makes this model tick. Imagine your smartphone assistant that can not only respond to your text but also understand photos you show it. InternVL works similarly—it pairs text and imagery, responding meaningfully to both!

Model Type: Multimodal large language model (MLLM)
Parameters: 40 billion
Image Size: 448 x 448
SFT Dataset: Utilizes a massive 12 million SFT samples

Quick Start Guide

Here’s how to load and run InternVL-Chat-V1-2-Plus for various tasks:

Model Loading

Using Python, you can easily load the model:

import torch
from transformers import AutoTokenizer, AutoModel

path = 'OpenGVLab/InternVL-Chat-V1-2-Plus'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval().cuda()

Conducting Inference

Let’s dive into some real use cases:

Text Conversation

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
question = "Hello, who are you?"
response, history = model.chat(tokenizer, None, question, return_history=True)
print(f'User: {question}')
print(f'Assistant: {response}')

Image Interaction

Single Image: Ask the model to describe an image.
Multi Image: Could include questions about similarities and differences!

Here’s how you can do it with a single image:

image = Image.open('examples/image.jpg').resize((448, 448))
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values.to(torch.bfloat16).cuda()
question = "Please describe the image shortly."
response = model.chat(tokenizer, pixel_values, question)
print(f'User: {question}')
print(f'Assistant: {response}')

Troubleshooting

If you encounter any issues while using InternVL-Chat-V1-2-Plus, consider the following:

Ensure you have the right version of transformers installed. We recommend transformers==4.37.2.
Verify that your input images are correctly formatted and accessible.
During multi-GPU usage, check that all tensors are assigned to the same GPU to prevent errors.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox