How to Use BLIP-2 for Image Captioning and Visual Question Answering

Mar 23, 2024 | Educational

Welcome, AI enthusiasts! Today, we’re diving into the fascinating world of the BLIP-2 model, utilizing its synergy with the OPT-2.7 billion parameters language model. Whether you want to generate captions for images or answer questions based on visual content, BLIP-2 is your ticket to the exciting domain of image-to-text interaction. Let’s break down how to use BLIP-2 and troubleshoot any potential hiccups along the way.

What is BLIP-2?

BLIP-2, or Bootstrapping Language-Image Pre-training, is a sophisticated model designed to bridge the gap between language and visual content. It comprises three key components: an image encoder, a Querying Transformer, and a large language model. The magic happens when we use frozen weights from pre-trained models, allowing the Querying Transformer to predict the next text token based purely on visual input and previous text.

BLIP-2 Architecture

Steps to Get Started

Let’s walk through how to implement BLIP-2 in different scenarios.

1. Running the Model on CPU

To run BLIP-2 on your CPU, follow these systematic steps:

import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

2. Running the Model on GPU

When utilizing a GPU, the process remains largely similar, but with an added emphasis on performance:

In Full Precision

# pip install accelerate
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

In Half Precision (`float16`)

# pip install accelerate
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

In 8-bit Precision (`int8`)

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

Understanding the Code through Analogy

Think of the BLIP-2 model as a highly skilled chef in a kitchen. In this kitchen, the raw ingredients (images) are processed with precision by the chef (the model) using carefully tuned tools (the processor and model). The chef predicts the dish (text output) based on the ingredients provided (the image) and the recipe (previous text). Just as an experienced chef may need various pans (memory configurations) to create complex dishes, our model may adapt its requirements depending on the presentation style (precision) chosen—be it a simple platter (CPU) or an intricate dish on fine china (GPU).

Troubleshooting Common Issues

As with any tech endeavor, you may encounter issues along your journey. Here are some troubleshooting tips:

  • Problem: Model fails to produce output.
  • Solution: Ensure you’re using the correct image URL and that the image format is compatible.
  • Problem: Memory limitations when running on a GPU.
  • Solution: Try reducing the model precision (to `float16` or `int8`) to lessen memory load.
  • Problem: Unexpected results or errors in outputs.
  • Solution: Check the input data for consistency and correctness. Make sure the questions align well with the image content.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Now go ahead and unleash your creativity with BLIP-2! Happy coding!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox