Welcome to a step-by-step guide that will empower you to create captivating image captions using the fine-tuned PaliGemma model. The model, which is specially configured to generate middle-sized captions (200 to 350 characters), is less prone to hallucinations, making your output more reliable and useful.
What You Will Need
- Python (3.6 or higher)
- Access to the Internet
- Understanding of basic Python libraries
- Visual Studio Code or any IDE of your choice
Installation Steps
To begin your journey, you’ll want to install the necessary libraries. Execute the following command in your terminal:
pip install git+https://github.com/huggingface/transformers
Importing Libraries
Once installed, you can start writing your code. Begin by importing the required libraries:
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
Load the Model and Processor
Next, you’ll need to load the model and processor. Think of the model as a chef and the processor as their sous-chef who prepares all the ingredients:
model_id = "gokaygokay/paligemma-rich-captions"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).to('cuda').eval()
processor = AutoProcessor.from_pretrained(model_id)
Fetch and Prepare Your Image
Now, it’s time to add an image that you want to caption:
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
Generate Captions
Now, you’re ready to create the captivating caption! You will formulate your caption prompt and process the input. Let’s break this down:
prompt = "caption: " + "en"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to('cuda')
input_len = model_inputs['input_ids'].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=256, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Here, you’re effectively blending the image and prompt to prepare it for captioning, much like a chef mixing ingredients before cooking. The model then provides a delicious final output—your caption!
Troubleshooting Tips
If you encounter issues, here are some troubleshooting ideas:
- Ensure that you have a CUDA compatible GPU installed if you want to enable GPU acceleration.
- Check that all libraries are correctly installed and updated.
- If the model or processor doesn’t load correctly, verify the model ID and try pulling the latest version from Hugging Face.
- Confirm that the image URL used is accessible and valid.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
You’ve now learned how to generate image captions using the PaliGemma model effectively! Utilize this process to enhance your image applications, whether for social media, blogs, or other creative projects.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

