How to Generate Descriptive Text for Images Using Transformers

Oct 28, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesmistral-community_pixtral-12b

In the world of artificial intelligence, bridging the gap between text and images has always been a fascinating challenge. Luckily, the Transformers library offers sophisticated tools for accomplishing this task seamlessly! In this article, we’ll explore how to utilize the transformers library along with the LlavaForConditionalGeneration model to describe images effectively.

Getting Started with the Pixtral Model

Before diving into coding, ensure you have the right version of Transformers installed. You’ll need to install from source or wait for v4.45. Here’s how to set up your coding environment:

Install Required Libraries: You should have Python and the necessary libraries like PIL and transformers on your system.
Use the Mistral Community Model: This example uses the model ID mistral-community/pixtral-12b.

Your First Code Snippet

Here’s a sample code to generate descriptive text for images:

from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

IMG_URLS = [
    "https://picsum.photos/id/2374/100/300",
    "https://picsum.photos/id/2312/100/300",
    "https://picsum.photos/id/2750/500",
    "https://picsum.photos/id/1715/060",
]
PROMPT = "[INST]Describe the images.[/INST]"

inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)

Understanding the Code

Imagine you’re a guide showing a group of tourists through a bustling art gallery. Each artwork you pass by needs a detailed explanation, and this code does just that, but for images on a digital platform!

The Tour Begins: It starts with importing the essential libraries, like bringing your map and camera before setting off on an adventure.
The Gallery Curator: The model and processor act like your knowledgeable guide, trained to provide insights into each artwork (or image).
Images on Display: The URL list of images represents the gallery exhibitions. Just as every exhibition is carefully curated, these image URLs are the subjects of your descriptions.
Prompting the Guide: The input prompt is akin to your guiding questions that encourage the curator (model) to share what each artwork is about.
Generating Insights: The process of generating text outputs is where the magic happens—like the guide painting a vivid picture of the scene before you!

Example Output

Your output might look something like this:

Image 1: A black dog on a wooden floor.
Image 2: A stunning mountain view.
Image 3: Sunset at the beach.
Image 4: A lush garden path.

Formatting with a Chat Template

In case you want to enrich your interactions or simulate a chat interface, here’s how to structure the chat template:

url_dog = "https://picsum.photos/id/2372/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"

chat = [
    {"role": "user", "content": [
        {"type": "text", "content": "Can this animal"},
        {"type": "image"},
        {"type": "text", "content": "live here?"},
        {"type": "image"}
    ]}
]
prompt = processor.apply_chat_template(chat)

inputs = processor(text=prompt, images=[url_dog, url_mountain], return_tensors="pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)

Troubleshooting & Tips

If you encounter issues during the process, consider the following troubleshooting steps:

Model Not Loading: Ensure you have the correct model ID and that your installation is up to date.
Images Not Displaying: Confirm that the image URLs are valid and properly formatted.
Text Not Generated Properly: Check your prompt’s structure and syntax; sometimes small errors can affect output.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Congratulations! You’ve just navigated the nuances of generating textual descriptions for images using the powerful transformers library and Pixtral model. Embrace this new skill, as it opens up countless possibilities in AI!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox