How to Use the LLaVA Model for Image-to-Text Generation

Jul 30, 2024 | Educational

The LLaVA (Language and Vision Assistant) model is a cutting-edge multimodal chatbot that has been trained to generate text based on images and conversational prompts. In this article, we’ll explore how to use the LLaVA model, troubleshoot common issues, and optimize its performance.

Understanding LLaVA

The LLaVA model is an open-source chatbot, structured as an auto-regressive language model which utilizes the transformer architecture. It specializes in image-to-text tasks by combining information from visual data with textual prompts.

How to Use the LLaVA Model

To get started with the LLaVA model, follow the steps below:

1. Setup Requirements

Make sure to have transformers >= 4.35.3 installed.
This model supports multi-image and multi-prompt generation.

2. Choosing the Right Template

Use the correct prompt format when querying the model: USER: xxx\nASSISTANT:. Make sure to place the token <image> at the spot where you want the model to analyze an image.

3. Example Usage with Pipeline

Here’s how to use the LLaVA model with the `pipeline`:

from transformers import pipeline
from PIL import Image
import requests

model_id = "llava-hf/llava-1.5-13b-hf"
pipe = pipeline("image-to-text", model=model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
            {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})

print(outputs)

In this example, you’re getting the model to interpret an image and respond to a question based on it.

4. Using Pure Transformers

If you want to run the model with more control, use the pure `transformers` approach. Here’s an example:

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "llava-hf/llava-1.5-13b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What are these?"},
            {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)

inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)

print(processor.decode(output[0][2:], skip_special_tokens=True))

In this example, the model is instructed to provide textual responses based on the provided image.

Optimizing Model Performance

Better performance can be achieved by implementing the following optimization techniques:

1. 4-bit Quantization

First, ensure you’ve installed bitsandbytes by running pip install bitsandbytes. Modify your model loading line as below:

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
    load_in_4bit=True
)

2. Use Flash Attention 2

To speed up generation further, you can use flash-attn. Refer to the original repository of Flash Attention for installation instructions. Update your model loading line as shown:

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
    use_flash_attention_2=True
).to(0)

Troubleshooting

Here are some common issues and their troubleshooting tips:

If you encounter errors related to image loading, ensure the URLs are correct and that you have internet access.
For performance issues, verify that you have set up the correct GPU and dependencies.
If the model isn’t responding as expected, double-check your prompt format and make sure you’re using the right image prompts.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

The LLaVA model is a powerful tool for image-to-text generation. By following the steps outlined in this article, you can harness its capabilities efficiently while troubleshooting any potential issues that arise along the way. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox