How to Utilize the InternVL-14B-FlickrCN-FT-364px Model for Image-to-Text and Text-to-Image Retrieval

Mar 12, 2024 | Educational

As the landscape of AI continues to evolve, image and text alignment models like the InternVL-14B-FlickrCN-FT-364px are paving the way for advanced visual-perceptual tasks. With a staggering 14 billion parameters and state-of-the-art capabilities, this model is designed to handle cross-modal retrieval tasks with remarkable efficiency. In this guide, we’ll walk you through everything you need to know to get started with InternVL, along with some troubleshooting tips for your journey.

Understanding InternVL

To really grasp the capability of InternVL, think of it as a highly skilled translator exploring an expansive library. By understanding both images and languages across cultures, it can fetch exactly what you’re looking for—be it a photo matching a description or words describing an image. But instead of books, it deals with millions of images and texts, ensuring that whatever you need, it’s able to find and connect them.

Model Details

  • Model Type: Fine-tuned retrieval model for image-text retrieval
  • Parameters: 14B
  • Image Size: 364 x 364
  • Fine-tune Dataset: FlickrCN

Getting Started with the Model

Here’s how to set up the InternVL-14B-FlickrCN-FT-364px model in your Python environment:

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor, AutoTokenizer

# Load the model
model = AutoModel.from_pretrained(
    "OpenGVLab/InternVL-14B-FlickrCN-FT-364px",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).cuda().eval()

# Load the image processor
image_processor = CLIPImageProcessor.from_pretrained("OpenGVLab/InternVL-14B-FlickrCN-FT-364px")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "OpenGVLab/InternVL-14B-FlickrCN-FT-364px",
    use_fast=False,
    add_eos_token=True
)
tokenizer.pad_token_id = 0  # Set pad_token_id to 0

images = [
    Image.open(".examples/image1.jpg").convert("RGB"),
    Image.open(".examples/image2.jpg").convert("RGB"),
    Image.open(".examples/image3.jpg").convert("RGB")
]

prefix = "summarize:"
texts = [
    prefix + "a photo of a red panda",  # English
    prefix + "一张熊猫的照片",  # Chinese
    prefix + "二匹の猫の写真"  # Japanese
]

pixel_values = image_processor(images=images, return_tensors="pt").pixel_values
input_ids = tokenizer(texts, return_tensors="pt", max_length=80, padding="max_length").input_ids.cuda()

# Perform image-text retrieval
logits_per_image, logits_per_text = model(image=pixel_values, text=input_ids, mode="InternVL-C")
probs = logits_per_image.softmax(dim=-1)

Breaking Down the Code with an Analogy

Consider this code like a chef preparing a special meal in a five-star restaurant:

  • Importing Ingredients: Just as a chef gathers necessary ingredients (images and texts), you begin by importing libraries such as torch, PIL, and transformers.
  • Assembling the Kitchen: Loading the model sets up your kitchen—ensuring everything is ready to go for the cooking process.
  • Preparing the Ingredients: The images and texts represent the raw ingredients which need to be properly processed to yield a delicious outcome.
  • Cooking: The model takes the processed images and texts and combines them to create a dish—here a probability score that tells how well the image and its description align.

Troubleshooting Tips

If you encounter any issues while using the model, here are some troubleshooting ideas:

  • Ensure that the prefix and tokenizer.pad_token_id are set to their required values. Their absence may lead to abnormal results.
  • Check the format and path of your image files. Ensure they are accessible and correctly formatted (e.g., RGB).
  • If you receive memory issues, consider using smaller images or adjusting resource allocation in your environment.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

This guide should empower you to dive into the functionality of the InternVL model with relative ease. The process of images and text merging will enable a realm of possibilities in multimedia interactions. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox