The Phi-3-Vision-128K-Instruct is a powerful multimodal model designed for analyzing both visual and textual inputs. Utilizing this state-of-the-art model can be quite beneficial for researchers and developers seeking to integrate AI capabilities into their applications. This guide outlines the necessary steps for implementation, while also providing useful troubleshooting tips.
Model Overview
The Phi-3-Vision-128K-Instruct model stands out with its ability to handle up to 128K tokens of context and was developed using a blend of high-quality datasets. This model is particularly useful in applications that require efficient processing of text and images.
Installing the Model
Follow these steps to get started with the Phi-3-Vision-128K-Instruct model:
- Ensure you have the development version (4.40.2) of
transformersinstalled. - Load the model with the following command:
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-vision-128k-instruct", device_map="cuda", trust_remote_code=True)
transformers package by running:
pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers
How to Interact with the Model
The model is primarily designed for chat-style interactions that involve single images. Here’s how to format your input:
<|user|>
<|image_1|>
{prompt}
<|end|>
<|assistant|>
For multi-turn conversations, format your prompts as follows:
<|user|>
<|image_1|>
{prompt_1}
<|end|>
<|assistant|>
{response_1}
<|end|>
<|user|>
{prompt_2}
<|end|>
<|assistant|>
Sample Inference Code
To get started with running the model on a GPU, here’s a sample code snippet:
from PIL import Image
import requests
from transformers import AutoModelForCausalLM, AutoProcessor
model_id = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Example message sequence
messages = [
{"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
{"role": "assistant", "content": "This is a description of what the image contains."}
]
# Load image
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
generation_args = {
"max_new_tokens": 500,
"temperature": 0.0,
"do_sample": False,
}
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
Understanding the Code: An Analogy
Imagine you are setting up a smart home system. Each device (like lights or speakers) needs to receive instructions and data to function properly. In this scenario:
- Devices: Represent different parts of the Phi-3 model (language and image processing).
- Instructions: Are like the code snippets provided here; they tell the devices how to respond to various commands.
- Input Data: Like the images and messages you send to your smart home system, this is what helps the model understand what to process and respond with.
- Smart Home App: Similar to the processor in our code, it helps bridge the gap between user commands (input) and device actions (output).
Troubleshooting Tips
Should you encounter issues while using the Phi-3-Vision-128K-Instruct model, consider the following:
- Ensure your GPU supports flash attention.
- If you need to run without flash attention, comment out specific lines in the code and adjust the configuration as instructed earlier.
- If you receive errors about dependencies, verify that you have the required packages installed, such as
torchandtransformers. - Check the model’s context length and ensure you are adhering to token limits.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

