How to Generate Image Descriptions with Moonline

May 10, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_1_244

Welcome to the fascinating world of image-to-text generation! In this blog, we will explore a powerful tool called Moonline, which enables you to transform images into descriptive text using predefined templates. Moonline is a versatile fork of the renowned Moondream2 model, incorporating features from the Outlines API for enhanced usability. Let’s dive into how you can set up and use Moonline to generate detailed descriptions from images.

Getting Started with Moonline

To start using Moonline, follow these simple steps:

Clone the Moonline repository from Hugging Face.
Set up a virtual environment for clean dependencies.
Install the required dependencies from requirements.txt.

Next, you will run the example.py script, which provides a straightforward example of generating a description and mood for an image.

Understanding the Code

The core of the Moonline functionality can be explained using an analogy. Imagine Moonline as a librarian in a vast library filled with unique books (images). Each book has its own story (the text description) but can only be told in a certain way according to a specific genre (pydantic model). Here’s how the librarian (Moonline) does its job:

The librarian (Moonline) first identifies the book (image) you handed over.
It reads the book (encodes the image) to understand its content.
Finally, based on your instructions (the prompt), it tells you the story in a specific format (JSON). This ensures every description matches your requirements (ExampleModel).

from PIL import Image
from transformers import AutoTokenizer
from pydantic import BaseModel
from enum import Enum
from moonline import Moonline

def main():
    class Mood(Enum):
        sad = "sad"
        happy = "happy"
        angry = "angry"
        neutral = "neutral"

    class ExampleModel(BaseModel):
        description: str
        mood: Mood

    prompt = f'Your job is to describe the image. Please answer in json with the following format: {ExampleModel.__annotations__}'
    image_path = "example.png"
    model_id = "vikhyatk/moondream2"
    revision = "2024-04-02"

    tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
    moonline = Moonline.from_pretrained(model_id, revision=revision).to()
    moonline.eval()
  
    image = Image.open(image_path)
    image_embeds = moonline.encode_image(image)
    fsm = moonline.generate_fsm(ExampleModel, tokenizer)
    answer = moonline.answer_question(image_embeds, prompt, tokenizer, fsm)
  
    print(f'answer: {answer}')

if __name__ == "__main__":
    main()

Example Output

When executed, the script will generate a JSON response that describes the image. For example:

{
    "description": "A cartoon house is shown sitting on a dirt road with a long gravel path. Plants and trees surround the house. In the distance, there is a canal or pond with ducks swimming about. The scene is full of greenery, and flowers bloom among the vegetation. The sky is a clear blue, and a lush, verdant landscape can be spotted in the background. There is a pathway leading towards the house.",
    "mood": "happy"
}

Troubleshooting

While using Moonline, you may encounter some challenges:

Model Hallucination: Sometimes Moonline may generate fields that don’t exist in the image. To mitigate this, try adjusting the prompts or offering options like None in the input.
Limitations in JSON Output: As Moonline is not specifically trained for JSON outputs, consider fine-tuning the model with JSON descriptions for better results.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Moonline is an innovative image-to-text generation tool that opens new doors in AI development. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

With the capability to summarize and describe images, Moonline empowers developers to harness the storytelling aspects of visual data. Start experimenting with this fantastic tool today!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox