How to Use the Fine-Tuned Microsoft Florence-2-Large Model for Image Captioning

Jul 31, 2024 | Educational

In this guide, we’ll explore how to effectively use the fine-tuned version of the microsoft/Florence-2-large model, which has been specially designed for generating captions from images. The model is based on a substantial dataset and has undergone meticulous tuning to enhance its performance.

Understanding the Model and Dataset

The model we are working with is microsoft/Florence-2-large, which has been fine-tuned on a subset of 40,000 images from the Ejafa/ye-pop dataset. This dataset includes diverse subjects, making it an excellent training ground for improving the model’s captioning capabilities. The captions were generated using the THUDM/cogvlm2-llama3-chat-19B model.

Training Details

Vision Encoder: Frozen during training.
Batch Size: 64
Gradient Accumulation Steps: 16
Learning Rate: 5.12e-05
Optimizer: AdamW
Scheduler: Polynomial
Epochs: 7.37

Analogy Explanation of the Code

Imagine that your task is to train a talented artist (the model) to create beautiful captions (artworks) based on various photographs (images). You meticulously select a range of diverse photographs and provide clear instructions (prompts) on how to observe and describe those images. The photos you choose are not only varied but also rich in details, allowing your artist to learn and interpret a wide spectrum of scenes. The training process is akin to teaching your artist with a guide that explains the intricacies of the given photographs step-by-step.

Now, let’s dive into the actual code you’ll use to put this model to work:

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True)

# Function to run the model on an example
def run_example(task_prompt, image):
    prompt = task_prompt
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

from PIL import Image
import requests
import copy

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
result = run_example("", image)
print(result) # {'': 'A vivid, close-up photograph of a classic car, specifically a Volkswagen Beetle, parked on a cobblestone street...'}

Running the Model

To use this model, simply follow these steps:

Install necessary packages like transformers and Pillow.
Load the model and processor from the Hugging Face Model Hub to your environment.
Prepare your image as RGB if it’s not already in that mode.
Define a prompt that you want the model to elaborate on.
Run the model using the run_example function you defined.
Print the result to see the generated caption.

Troubleshooting

If you encounter any issues while implementing the model, consider the following solutions:

Ensure that your runtime environment has the necessary libraries installed. You may need to update them if there are issues.
If the images are not loading correctly, check the URL and ensure the image format is supported.
Review the available memory in your environment; working with large models can sometimes exceed limitations.
Update the model and processor versions for any recent bug fixes or enhancements.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox