How to Generate In-Depth Image Descriptions Using AI

Jun 28, 2024 | Educational

In the realm of artificial intelligence, the ability to describe images in a detailed manner is a fascinating application that harnesses the power of deep learning. Today, we are going to explore how to use a pre-trained model to generate descriptive captions for images using Python. We’ll break it down step by step, ensuring everything is user-friendly and easy to follow.

What You’ll Need

Python 3.x
Libraries: datasets, transformers, flash_attn, timm, einops
Access to a CUDA-enabled GPU (optional, but recommended for better performance)

1. Installing the Required Libraries

First, you need to install the required libraries. Open your terminal or command prompt and run:

pip install -q datasets flash_attn timm einops

2. Setting Up the Model

Next, let’s set up our model using the Transformers library. This model will help us generate captions based on the images we provide.

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("gokaygokay/Florence-2-SD3-Captioner", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("gokaygokay/Florence-2-SD3-Captioner", trust_remote_code=True)

3. Running an Example

Now, we will create a function to run our model on an example image. This function takes a task prompt, an input text, and an image, and returns a descriptive caption.

Think of this process like asking a friend to describe a painting. You give them context (the prompt), they look at the painting (the image), and then they articulate their thoughts (the generated caption).

def run_example(task_prompt, text_input, image):
    prompt = task_prompt + text_input
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

4. Testing with an Image

For this example, we will use an image hosted online. Here’s how you fetch an image and use your function to generate a description:

from PIL import Image
import requests

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
print(run_example("", "Describe this image in great detail.", image))

Troubleshooting

Encountering issues while running your code? Here are a few troubleshooting tips:

Ensure that all the libraries are installed correctly. You can run the installation commands again.
Check if your device is set to use CUDA. If you don’t have a GPU, the code will run on the CPU, albeit slower.
Make sure the image URL is accessible; broken links can lead to errors.
If you get an error with the model loading, verify your internet connection and ensure the model’s name is correct.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following this guide, you should now be able to generate detailed descriptions for images using a powerful AI model. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox