The PaliGemma model is a versatile and lightweight vision-language model that can handle tasks involving both images and text. In this blog post, we’ll guide you through using PaliGemma, explain the underlying code, and provide troubleshooting tips to ensure your experience is smooth and fruitful.
Accessing PaliGemma on Hugging Face
Before diving in, you need to access PaliGemma on Hugging Face. To do this:
- Review and agree to Google’s usage license.
- Ensure you’re logged into Hugging Face.
- Click the appropriate button to acknowledge the license.
Understanding PaliGemma: An Analogy
Imagine you have a powerful chef (the PaliGemma model) in your kitchen. This chef specializes in a unique cuisine that combines elements from various cultures (image and text). They can prepare a dish based on two inputs: a recipe (text) and ingredients (image). However, this chef is not suited for casual dinner conversations (not meant for multi-turn interaction). If you want to get the best out of this chef, you need to teach them specific recipes (fine-tuning for tasks) tailored to your guests’ preferences (your project requirements).
How to Use PaliGemma
PaliGemma can be employed for various tasks including image captioning, visual question answering, and object detection. Let’s cover the steps to utilize this model.
1. Basic Setup for CPU (Float32 Precision)
For those running on CPU, the following code snippet will execute:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "googlepaligemma-3b-mix-224"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)
# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
2. Using CUDA for bfloat16 Precision
If you have an NVIDIA CUDA card and want to reduce memory load, use the following code:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "googlepaligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=dtype,
device_map=device,
revision="bfloat16"
).eval()
processor = AutoProcessor.from_pretrained(model_id)
# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Troubleshooting Tips
If you encounter issues while using PaliGemma, here are some troubleshooting ideas:
- Ensure that you have installed all the necessary packages, including transformers, torch, and PIL.
- If the model fails to load, double-check the Hugging Face model page for any updates or changes.
- For any unexpected outputs or errors during inference, ensure your input data (both text and image) is properly formatted.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

