How to Use BLIP-2 with Flan T5-xxl for Image Captioning and VQA

Apr 1, 2024 | Educational

Welcome to the world of AI-powered image analysis! Today, we’ll dive into BLIP-2, a powerful model that utilizes the large language model Flan T5-xxl. Let’s explore how you can effectively leverage this technology for tasks such as image captioning, visual question answering (VQA), and creating chat-like conversations based on images.

Understanding BLIP-2

BLIP-2 is a sophisticated model that consists of three components:

  • A CLIP-like image encoder
  • A Querying Transformer (Q-Former)
  • A large language model (Flan T5-xxl)

Imagine the image encoder as a talented artist who can interpret pictures. The Q-Former acts like a translator, turning the artist’s interpretations into meaningful queries that are then explored by the large language model—the author of an accompanying narrative. This collaboration helps in predicting the next text token based on the provided image and prior text.

How to Use BLIP-2?

Here’s a step-by-step guide to using the BLIP-2 model for image captioning and visual question answering:

1. Setting Up the Environment

First, ensure you have the required libraries installed. You can install them using pip:

pip install transformers Pillow requests

2. Running the Model on a CPU

To run the model on a CPU, use the following Python code:

import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")

img_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")

question = "How many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

3. Running the Model on a GPU

When using a GPU, there are several precision options you can choose from:

Full Precision

import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

# pip install accelerate
processor = BlipProcessor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", device_map="auto")

img_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")

question = "How many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Half Precision

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

# pip install accelerate
# Ensure you have installed the bitsandbytes library.
processor = BlipProcessor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", torch_dtype=torch.float16, device_map="auto")

img_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")

question = "How many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

8-bit Precision

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

# pip install accelerate bitsandbytes
processor = BlipProcessor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")

img_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")

question = "How many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Troubleshooting Common Issues

  • Issue: Environment Setup Problem – Ensure that you have the correct libraries installed and that your Python version is compatible.
  • Issue: Model Loading Failure – Verify the model path and check for internet connectivity issues if loading from Hugging Face model hub.
  • Issue: Image URL Not Working – Make sure the image URL is correct and publicly accessible.
  • Issue: Runtime Errors – Read the error messages carefully; they often indicate what went wrong, such as memory issues or incorrect tensor shapes.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following the steps outlined above, you can effectively utilize BLIP-2 combined with Flan T5-xxl for powerful image captioning and visual question answering tasks. As with any AI model, do remember to evaluate its outputs carefully and ensure it operates fairly and safely.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox