How to Use the Phi-3 Vision 128K Instruct Model

Jul 19, 2024 | Educational

The Phi-3-Vision-128K-Instruct model is a compelling new player in the realm of artificial intelligence, particularly in processing visual and textual information. Think of it as a well-trained bilingual guide, capable of interpreting images while providing insightful textual descriptions. In this article, we’ll explore how to harness the power of this model, troubleshoot common issues, and embrace its potential for various applications.

Model Overview

The Phi-3-Vision-128K-Instruct is a state-of-the-art multimodal model that understands both text and images. It’s built upon curated datasets aimed at providing high-quality and reasoning-dense data. Imagine it as a library filled with well-organized books (data), all ready to answer your questions—be it an image interpretation or generating text responses.

How to Get Started

Prerequisites

To use the Phi-3 model, you need to have the development version (4.40.2) of `transformers` installed. Just like a tool that requires sharpening before use, you must prepare your setup:

1. Uninstall the Old Version:
“`bash
pip uninstall -y transformers
“`

2. Install the Development Version:
“`bash
pip install git+https://github.com/huggingface/transformers
“`

3. Verify Installation:
“`bash
pip list | grep transformers
“`

Example Inference Code

Here’s a code snippet that demonstrates how to leverage the model using Python. Think of it as a recipe to bake a delicious cake—ensure you have the right ingredients!


from PIL import Image
import requests
from transformers import AutoModelForCausalLM, AutoProcessor

# Load the model
model_id = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2')
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Prepare inputs
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
    {"role": "assistant", "content": "The chart displays... (description)."}
]

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")

generation_args = {
    "max_new_tokens": 500,
    "temperature": 0.0,
    "do_sample": False,
}

generate_ids = model.generate(inputs, eos_token_id=processor.tokenizer.eos_token_id, generation_args)
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(response)

Analogous Explanation

Think of using this model like visiting a high-tech museum. The images are exhibits, and you can ask questions about them. The model acts like a knowledgeable curator. You provide an image (exhibit) and a question (query), similar to how you would inquire about an artwork, and it responds with insightful information, enriching your experience without needing a physical guide.

Troubleshooting Common Issues

Even the best models can encounter bumps along the road. Here are some troubleshooting tips if you run into issues when using the Phi-3 model:

1. Model Not Loading: If the model fails to load, double-check that you’ve set the `trust_remote_code=True` when calling `from_pretrained()`.

2. Dependency Errors: Ensure that all required packages are installed. Here are the necessary packages:
“`bash
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.40.2
“`

3. Unexpected Outputs: If the responses are inconsistent or erroneous, it could be due to the limitations on the language model’s training dataset. Review the types of data it was trained on and adjust your prompts accordingly.

For more troubleshooting questions/issues, contact our fxis.ai data scientist expert team.

Responsible AI Considerations

As powerful as the Phi-3 model is, it is not without its responsibilities. Users must exercise caution, especially in sensitive applications, as it can unintentionally produce biased or potentially harmful content. Always ensure that your use case adheres to relevant laws and regulations, particularly in high-stakes situations.

Conclusion

With its impressive capabilities, the Phi-3-Vision-128K-Instruct model serves as a powerful tool for both commercial and research purposes. By following the setup instructions, leveraging the insightful code examples, and being mindful of responsible AI practices, you can unlock the potential of this remarkable model and contribute to the ever-evolving field of artificial intelligence. Happy coding!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox