How to Use PG-InstructBLIP for Image Captioning

Jan 23, 2024 | Educational

Welcome to this detailed guide on using the PG-InstructBLIP model, a finetuned version of InstructBLIP that leverages the power of the Flan-T5-XXL language model for advanced image processing tasks. This model, developed for robotic manipulation, allows you to classify objects based on physical properties, enhancing your understanding of common household items through visual interaction.

What is PG-InstructBLIP?

PG-InstructBLIP is designed to improve the understanding of physical object concepts using a massive dataset called the PhysObjects dataset. This dataset contains over 36.9K crowd-sourced annotations and 417K automated annotations for common objects, making it an invaluable resource for training models in vision-language tasks.

Installing PG-InstructBLIP

To use PG-InstructBLIP effectively, you’ll need to install the LAVIS library. Follow these steps for installation:

Install the LAVIS library from source: salesforce-lavis
Download the PG-InstructBLIP model via git-lfs or directly.

Example Usage

Below is a simple example demonstrating how to use PG-InstructBLIP to determine the transparency of an opaque bowl.

import torch
from PIL import Image
from omegaconf import OmegaConf
from lavis.models import load_model, load_preprocess
from lavis.common.registry import registry
import requests
from generate import generate

# Load image
url = "https://iliad.stanford.edu/pg-vlm/example_images/ceramic_bowl.jpg"
example_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Load model
vlm = load_model(
    name="blip2_t5_instruct",
    model_type="flant5xxl",
    checkpoint="pgvlm_weights.bin",  # replace with location of downloaded weights
    is_eval=True,
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Optionally disable qformer text input
vlm.qformer_text_input = False  

# Preprocess image
model_cls = registry.get_model_class("blip2_t5_instruct")
model_type = "flant5xxl"
preprocess_cfg = OmegaConf.load(model_cls.default_config_path(model_type)).preprocess
vis_processors, _ = load_preprocess(preprocess_cfg)
processor = vis_processors["eval"]

# Prepare question
question_samples = {
    "prompt": "Question: Classify this object as transparent, translucent, or opaque? Respond unknown if you are not sure. Short answer:",
    "image": torch.stack([processor(example_image)], dim=0).to(vlm.device)
}

# Generate answers
answers, scores = generate(vlm, question_samples, length_penalty=0, repetition_penalty=1, num_captions=3)
print(answers, scores)

This code snippet can be thought of as a chef preparing a dish. Here’s how:

Ingredients Gathering: The image of the ceramic bowl is the main ingredient that we prepare to evaluate.
Kitchen Setup: The model is loaded like setting up the kitchen, ensuring all tools (like the GPU for computation) are ready.
Cooking Process: The model processes the image, akin to a chef mixing the ingredients and asking the right questions to refine the dish (in this case, identifying the transparency of the bowl).
Tasting: Finally, the answers and their confidence scores are the results of our dish, ready to be evaluated for quality!

Troubleshooting

If you encounter issues while setting up or using PG-InstructBLIP, consider the following troubleshooting tips:

Ensure you have installed all required dependencies from the LAVIS library.
Double-check that the model weights have been correctly downloaded and the file path is accurate in your code.
If the model produces unexpected outputs, try enabling the qformer text input to see if it alters the results.
Make sure your input images are cropped to focus solely on the object for the best classification results.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

In this guide, you’ve learned how to install and utilize the PG-InstructBLIP model for image captioning and object classification. This powerful model enhances interaction with physical concepts, making it a valuable asset for any AI developer or researcher.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox