Getting Started with Molmo 7B-D: A Step-by-Step Guide

Oct 28, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesallenai_Molmo-7B-D-0924

Welcome to the world of Molmo, where vision and language converge seamlessly! Developed by the Allen Institute for AI, the Molmo family of models, specifically Molmo 7B-D, is designed for captivating image-text interactions.

Step-by-Step Installation and Setup

Before diving into the Molmo 7B-D model, let’s set the stage by installing the necessary dependencies and then walk through the quick start guide.

Installing Dependencies

Begin by installing the required libraries.

pip install einops torchvision

Run the Molmo Model

Now let’s see how you can load the model and generate text based on an image input. Below is a detailed breakdown:

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests

# load the processor
processor = AutoProcessor.from_pretrained(
    "allenai/Molmo-7B-D-0924",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

# load the model
model = AutoModelForCausalLM.from_pretrained(
    "allenai/Molmo-7B-D-0924",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)

# process the image and text
inputs = processor.process(
    images=[Image.open(requests.get("https://picsum.photos/id/237536354", stream=True).raw)],
    text="Describe this image."
)

# move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

# generate output; maximum 200 new tokens; stop generation when endoftext is generated
output = model.generate_from_batch(
    inputs,
    GenerationConfig(max_new_tokens=200, stop_strings="endoftext"),
    tokenizer=processor.tokenizer
)

# only get generated tokens; decode them to text
generated_tokens = output[0, inputs["input_ids"].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)

# print the generated text
print(generated_text)

Understanding the Code: An Analogy

Think of your computer as a chef preparing a gourmet dish. The ingredients are represented by the various libraries and dependencies you installed, which go into creating the dish—the Molmo model.

The processor is like the sous-chef helping to chop and prepare the ingredients (images and text).
The model is the main chef putting all the ingredients together to create a delightful meal (the generated text).
The final output is the beautifully plated dish—ready to be served before the guests (users)!

Enhancing Efficiency with Autocast

To make your GPU’s memory usage more efficient, you can run your inference with autocast:

with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
    output = model.generate_from_batch(
        inputs,
        GenerationConfig(max_new_tokens=200, stop_strings="endoftext"),
        tokenizer=processor.tokenizer
    )

Troubleshooting Tips

Here are some common problems you may encounter while working with Molmo, along with solutions:

Broadcast Error: If you encounter this error while processing images, it could be due to the image not being in RGB format. Convert it using the following code:

from PIL import Image
image = Image.open(...)
if image.mode != "RGB":
    image = image.convert("RGB")

Transparent Images: Molmo may not work well with transparent images. Consider adding a solid background before passing the image to the model using the PIL library.
Background Addition Example: You can utilize a simple solution using PIL to create a background based on the image’s brightness.

For persistent issues, you can reach out for more help or check online forums. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox