Getting Started with Molmo 72B: Your Guide to Vision-Language Models

Oct 28, 2024 | Educational

The Molmo 72B is an impressive open-source vision-language model developed by the Allen Institute for AI, leveraging advanced AI techniques to process and understand image-text pairs. In this article, we’ll walk you through how to quickly set up and run Molmo, along with some troubleshooting ideas to help you overcome common issues.

What is Molmo 72B?

Molmo is part of a family of multimodal models trained on PixMo, a dataset containing over a million curated image-text pairs. This model is designed to achieve state-of-the-art performance while remaining fully open-source. To visualize it, think of Molmo 72B as a highly skilled chef who can create dishes (outputs) based on the ingredients (image-text pairs) presented to them. The model’s ability to perfectly blend and interpret these ingredients results in a delicious final dish, or in this case, meaningful outputs.

Quick Start with Molmo 72B

To run Molmo, follow these simple steps:

Step 1: Install Dependencies

pip install einops torchvision

Step 2: Load the Model

Now, you can use the following Python code to initialize the model:

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
import torch

# load the processor
processor = AutoProcessor.from_pretrained(
    "allenai/Molmo-72B-0924",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

# load the model
model = AutoModelForCausalLM.from_pretrained(
    "allenai/Molmo-72B-0924",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

Step 3: Processing Images and Text

The following code processes an image and a piece of text:

inputs = processor.process(
    images=[Image.open(requests.get("https://picsum.photos/id/237536354", stream=True).raw)],
    text="Describe this image."
)

Step 4: Generate Output

Now, let’s generate the output. Use the following code:

output = model.generate_from_batch(
    inputs,
    GenerationConfig(max_new_tokens=200, stop_strings="endoftext"),
    tokenizer=processor.tokenizer
)

Troubleshooting Common Issues

Here are some troubleshooting ideas for issues you might encounter while using the Molmo model:

  • Broadcast Error with Images: If you encounter a broadcast error when processing images, your image might not be in RGB format. Convert it using the following code snippet:
  • from PIL import Image
    image = Image.open(...)
    if image.mode != "RGB":
        image = image.convert("RGB")
  • Transparent Images: Molmo may struggle with transparent images. Add a white or dark background to your images using this code:
  • # Load the image
    url = ...
    image = Image.open(requests.get(url, stream=True).raw)
    # Convert the image to grayscale to calculate brightness
    gray_image = image.convert("L")
    # Calculate the average brightness
    from PIL import ImageStat
    stat = ImageStat.Stat(gray_image)
    average_brightness = stat.mean[0]
    # Define background color based on brightness
    bg_color = (0, 0, 0) if average_brightness < 127 else (255, 255, 255)
    # Create a new image with the background color
    new_image = Image.new("RGB", image.size, bg_color)
    new_image.paste(image, (0, 0), image if image.mode == "RGBA" else None)
    # Pass the new image to Molmo processor

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Molmo 72B opens new avenues in the understanding of vision-language models, providing the community with a powerful tool for research and application. By following the steps outlined above, you'll be well on your way to harnessing its full potential.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox