How to Use InternVL-14B for Image-to-Text and Text-to-Image Retrieval

Mar 10, 2024 | Educational

In the world of artificial intelligence, models that can handle image and text data simultaneously are quickly becoming essential tools. One such marvel is **InternVL-14B**, an impressive vision-language foundation model that excels in image-to-text and text-to-image retrieval tasks. In this blog, we will walk you through how to utilize the InternVL-14B model effectively for your projects. Let’s dive in!

What You Need to Know

InternVL-14B is characterized by:

  • Model Type: A fine-tuned retrieval model
  • Support Tasks: Image-text retrieval
  • Parameters: 14 billion
  • Image Size: 364 x 364

To better understand the model’s capabilities, think of it like a highly skilled librarian capable of finding exact books and images based on input descriptions. Just as a librarian can quickly navigate rows of shelves using keywords, InternVL-14B matches images and text efficiently.

Setting Up InternVL-14B

Follow these steps to set up the InternVL-14B model:

  • Ensure you have InternVL’s GitHub repository cloned to access model files.
  • Install the required packages using pip:
  • pip install transformers torch PIL
  • Download the model weights.

Using the Model

The model usage involves a few key coding steps. Here’s how to go about it:

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer

model = AutoModel.from_pretrained(
    "OpenGVLab/InternVL-14B-Flickr30K-FT-364px",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained("OpenGVLab/InternVL-14B-Flickr30K-FT-364px")
tokenizer = AutoTokenizer.from_pretrained(
    "OpenGVLab/InternVL-14B-Flickr30K-FT-364px",
    use_fast=False,
    add_eos_token=True
)
tokenizer.pad_token_id = 0  # set pad_token_id to 0

images = [
    Image.open(".examples/image1.jpg").convert("RGB"),
    Image.open(".examples/image2.jpg").convert("RGB"),
    Image.open(".examples/image3.jpg").convert("RGB"),
]
prefix = "summarize:"
texts = [
    prefix + "a photo of a red panda",  # English
    prefix + "",  # Chinese
    prefix + ""   # Japanese
]

pixel_values = image_processor(images=images, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

input_ids = tokenizer(texts, return_tensors="pt", max_length=80, truncation=True, padding="max_length").input_ids.cuda()

# InternVL-C
logits_per_image, logits_per_text = model(image=pixel_values, text=input_ids, mode="InternVL-C")
probs = logits_per_image.softmax(dim=-1)

# InternVL-G
logits_per_image, logits_per_text = model(image=pixel_values, text=input_ids, mode="InternVL-G")
probs = logits_per_image.softmax(dim=-1)

In this code snippet, you initialize the model and tokenizer, set up the image processor, load images, and then run the model to retrieve relevant information based on your images and text input.

Troubleshooting Tips

If you run into issues, consider the following checks:

  • Ensure all libraries are up to date and installed correctly.
  • Make sure that image paths are accurate and the images exist.
  • Confirm that you have set the tokenizer.pad_token_id = 0 when accessing the tokenizer, as omitting this might yield unexpected results.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

Using the InternVL-14B model can dramatically enhance your image and text processing capability. Whether you are fetching relevant images for a text-based query or generating text based on images, InternVL provides a sophisticated and efficient solution. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox