How to Get Started with Florence-2: A Unified Vision Representation Model

Jul 22, 2024 | Educational

Florence-2 stands as a remarkable advancement in vision foundation models. Developed by Microsoft, this model utilizes a prompt-based approach to effectively address a wide array of vision and vision-language tasks, such as captioning, object detection, and segmentation. In this article, we will walk you through how to get started with the Florence-2 model, its capabilities, and troubleshooting tips that ensure a smooth experience.

Model Summary

The Florence-2 model leverages a massive dataset known as FLD-5B, comprising 5.4 billion annotations across 126 million images, enabling its proficiency in multi-task learning. Its architecture excels in adapting to different tasks, whether in zero-shot or fine-tuned scenarios, making it a competitive option among vision foundation models.

How to Get Started with Florence-2

Ready to harness the power of Florence-2? Here’s how you can get started easily:

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-base-ft", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)

prompt = ""
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)

generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    do_sample=False,
    num_beams=3
)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="", image_size=(image.width, image.height))

print(parsed_answer)

Understanding the Code Through Analogy

Imagine you’re a painter (the model) who needs to describe scenes painted on a canvas (the images) based on instructions given (prompts). In this scenario, the code provides you with all the necessary tools:

Importing Libraries: Like a painter gathering brushes and colors, you start by importing essential libraries.
Setting the Device: Picking the right canvas (GPU or CPU) ensures you can work efficiently.
Loading the Model: This is akin to taking out your favorite paint set, preparing it for use.
Getting the Input: Here, you’re downloading the scene you want to paint (input image) and organizing your paint (returning tensors with images and prompts).
Generating the Output: Finally, you apply your skills to create a masterpiece (interpreting the prompt and returning generated text).

Tasks You Can Perform with Florence-2

The Florence-2 can perform various tasks simply by updating the prompt. Here are a few examples:

Captioning: Using the prompt ““, you can generate captions for images.
Object Detection: Utilize “” to identify objects in the image.
OCR (Optical Character Recognition): Use “” to extract text present in images.

Troubleshooting Tips

If you encounter issues while working with Florence-2, here are some troubleshooting ideas:

Check for Compatibility: Ensure your environment supports CUDA if you’re working with GPUs. If not, switch to CPU mode.
Library Versions: Verify you have the latest versions of the required libraries, as outdated packages may cause conflicts.
Image Download Error: If images do not load, ensure that the provided URLs are correct and accessible.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox