The Molmo 72B is an impressive open-source vision-language model developed by the Allen Institute for AI, leveraging advanced AI techniques to process and understand image-text pairs. In this article, we’ll walk you through how to quickly set up and run Molmo, along with some troubleshooting ideas to help you overcome common issues.
What is Molmo 72B?
Molmo is part of a family of multimodal models trained on PixMo, a dataset containing over a million curated image-text pairs. This model is designed to achieve state-of-the-art performance while remaining fully open-source. To visualize it, think of Molmo 72B as a highly skilled chef who can create dishes (outputs) based on the ingredients (image-text pairs) presented to them. The model’s ability to perfectly blend and interpret these ingredients results in a delicious final dish, or in this case, meaningful outputs.
Quick Start with Molmo 72B
To run Molmo, follow these simple steps:
Step 1: Install Dependencies
pip install einops torchvision
Step 2: Load the Model
Now, you can use the following Python code to initialize the model:
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
import torch
# load the processor
processor = AutoProcessor.from_pretrained(
"allenai/Molmo-72B-0924",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
)
# load the model
model = AutoModelForCausalLM.from_pretrained(
"allenai/Molmo-72B-0924",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
)
Step 3: Processing Images and Text
The following code processes an image and a piece of text:
inputs = processor.process(
images=[Image.open(requests.get("https://picsum.photos/id/237536354", stream=True).raw)],
text="Describe this image."
)
Step 4: Generate Output
Now, let’s generate the output. Use the following code:
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=200, stop_strings="endoftext"),
tokenizer=processor.tokenizer
)
Troubleshooting Common Issues
Here are some troubleshooting ideas for issues you might encounter while using the Molmo model:
- Broadcast Error with Images: If you encounter a broadcast error when processing images, your image might not be in RGB format. Convert it using the following code snippet:
from PIL import Image
image = Image.open(...)
if image.mode != "RGB":
image = image.convert("RGB")
# Load the image
url = ...
image = Image.open(requests.get(url, stream=True).raw)
# Convert the image to grayscale to calculate brightness
gray_image = image.convert("L")
# Calculate the average brightness
from PIL import ImageStat
stat = ImageStat.Stat(gray_image)
average_brightness = stat.mean[0]
# Define background color based on brightness
bg_color = (0, 0, 0) if average_brightness < 127 else (255, 255, 255)
# Create a new image with the background color
new_image = Image.new("RGB", image.size, bg_color)
new_image.paste(image, (0, 0), image if image.mode == "RGBA" else None)
# Pass the new image to Molmo processor
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
Molmo 72B opens new avenues in the understanding of vision-language models, providing the community with a powerful tool for research and application. By following the steps outlined above, you'll be well on your way to harnessing its full potential.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.