How to Use Phi-3-Vision-128K-Instruct: A Comprehensive Guide

May 30, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_0_38

The Phi-3-Vision-128K-Instruct is a powerful multimodal model designed for analyzing both visual and textual inputs. Utilizing this state-of-the-art model can be quite beneficial for researchers and developers seeking to integrate AI capabilities into their applications. This guide outlines the necessary steps for implementation, while also providing useful troubleshooting tips.

Model Overview

The Phi-3-Vision-128K-Instruct model stands out with its ability to handle up to 128K tokens of context and was developed using a blend of high-quality datasets. This model is particularly useful in applications that require efficient processing of text and images.

Installing the Model

Follow these steps to get started with the Phi-3-Vision-128K-Instruct model:

Ensure you have the development version (4.40.2) of transformers installed.
Load the model with the following command:

model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-vision-128k-instruct", device_map="cuda", trust_remote_code=True)

If required, update your local transformers package by running:

pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers

How to Interact with the Model

The model is primarily designed for chat-style interactions that involve single images. Here’s how to format your input:

<|user|>
<|image_1|>
{prompt}
<|end|>
<|assistant|>

For multi-turn conversations, format your prompts as follows:

<|user|>
<|image_1|>
{prompt_1}
<|end|>
<|assistant|>
{response_1}
<|end|>
<|user|>
{prompt_2}
<|end|>
<|assistant|>

Sample Inference Code

To get started with running the model on a GPU, here’s a sample code snippet:

from PIL import Image
import requests
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Example message sequence
messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
    {"role": "assistant", "content": "This is a description of what the image contains."}
]

# Load image
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")

generation_args = {
    "max_new_tokens": 500,
    "temperature": 0.0,
    "do_sample": False,
}

generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(response)

Understanding the Code: An Analogy

Imagine you are setting up a smart home system. Each device (like lights or speakers) needs to receive instructions and data to function properly. In this scenario:

Devices: Represent different parts of the Phi-3 model (language and image processing).
Instructions: Are like the code snippets provided here; they tell the devices how to respond to various commands.
Input Data: Like the images and messages you send to your smart home system, this is what helps the model understand what to process and respond with.
Smart Home App: Similar to the processor in our code, it helps bridge the gap between user commands (input) and device actions (output).

Troubleshooting Tips

Should you encounter issues while using the Phi-3-Vision-128K-Instruct model, consider the following:

Ensure your GPU supports flash attention.
If you need to run without flash attention, comment out specific lines in the code and adjust the configuration as instructed earlier.
If you receive errors about dependencies, verify that you have the required packages installed, such as torch and transformers.
Check the model’s context length and ensure you are adhering to token limits.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox