How to Use LLaVA-Critic-7B for Evaluating Multimodal Models

Oct 28, 2024 | Educational

Welcome to the world of LLaVA-Critic-7B! This incredible large multimodal model (LMM) is designed to assist in evaluating the performance of other models across various multimodal scenarios. In this guide, we will walk you through the steps to effectively use this cutting-edge tool, ensuring seamless integration into your projects.

Model Overview

LLaVA-Critic-7B stands out as an open-source evaluator for multimodal models. Developed from the foundation of llava-onevision-7b-ov, it has been finetuned on the LLaVA-Critic-113k dataset to sharpen its evaluation capabilities.

Key Features

  • LMM-as-a-Judge: It provides judgments closely aligned with human perceptions, grounding its assessments in concrete, image-based evidence.
  • Preference Learning: It enhances visual chat capabilities with reliable reward signals, powering models like LLaVA-OV-Chat.

Quick Start Guide

Follow these steps to get LLaVA-Critic-7B up and running:


# Installation
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

# Import necessary libraries
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
import requests
import torch
from PIL import Image

# Load pre-trained model
pretrained = "lmms-labllava-critic-7b"
model_name = "llava_qwen"
device = "cuda"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device)

# Set to evaluation mode
model.eval()

# Load an image
url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

How It Works: An Analogy

Imagine entering a culinary competition where chefs present their dishes to a panel of judges. Each dish (the model responses) is scrutinized based not just on taste (evaluation criteria) but also presentation, creativity, and origin of ingredients (image-grounded reasoning).

Similarly, LLaVA-Critic-7B acts as the judging panel, evaluating dishes prepared by other models. For each dish, LLaVA-Critic scores the quality and provides detailed feedback, akin to judges who taste each dish and offer constructive criticism.

Evaluation Settings

LLaVA-Critic primarily operates in two evaluation modes:

  • Pointwise Scoring: Assigns a score to a single candidate response.
  • Pairwise Ranking: Compares two candidate responses to identify which one excels.

Example Usage

After loading your model and preparing an image, you can use the following prompts to engage the model:


# Pairwise Ranking
critic_prompt = "Given an image and a corresponding question, serve as an unbiased judge to evaluate the quality of answers provided by a multimodal model."
# Define your question and responses
question = "What does this image present?"
response1 = "The image is a black and white sketch of a cross."
response2 = "This is a handwritten number seven."

# Setup conversation template and prompt
conv_template = "qwen_1_5"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

# Generate evaluation
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]
cont = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes, do_sample=False, temperature=0, max_new_tokens=4096)

# Get and print output
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])

Troubleshooting Your Setup

If you encounter issues while using LLaVA-Critic-7B, consider the following troubleshooting tips:

  • Check that all libraries are installed correctly to avoid `ImportError` messages.
  • Ensure your model is set to evaluation mode with `model.eval()` to get the correct output.
  • Verify that the images are being processed successfully; otherwise, revisit the image loading step.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With LLaVA-Critic-7B, evaluating multimodal models has never been easier. By following this guide, you’re well on your way to harnessing the full potential of this groundbreaking technology.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox