How to Use Microsoft Florence-2 with Transformers.js

Oct 28, 2024 | Educational

If you’re looking to explore the capabilities of image captioning using the Microsoft Florence-2 model with Transformers.js, you’ve come to the right place! In this guide, we’ll walk you step-by-step through the process of setting up and using this powerful model.

Getting Started

The Microsoft Florence-2 model is designed for generating text from images and works flawlessly with ONNX weights for compatibility with Transformers.js. The support for Florence-2 is currently experimental, so let’s delve into how you can make it work!

Installation Requirements

  • Ensure you have Node.js installed on your machine.
  • You will need the Transformers.js library. It is advised to install version 3 directly from source.

Installing Transformers.js

To install the required version of Transformers.js, execute the following command in your terminal:

npm install xenovatransformers.js#v3

Example: Performing Image Captioning

Now that you have everything set up, let’s move on to the code that performs image captioning with the Florence-2 model.

Think of the process like a chef preparing a gourmet meal. Each ingredient (or component) is essential for the final dish. Here’s how the ingredients interact:

  • The model is like the chef, responsible for creating the final caption based on the input image.
  • The processor prepares the ingredients (image data) in a way the chef can work with.
  • The tokenizer helps to convert the output (caption) into a readable format, similar to a waiter presenting your meal.

Here’s how the code looks:

import { 
    Florence2ForConditionalGeneration, 
    AutoProcessor, 
    AutoTokenizer, 
    RawImage 
} from '@xenovatransformers';

// Load model, processor, and tokenizer
const model_id = 'onnx-communityFlorence-2-base-ft';
const model = await Florence2ForConditionalGeneration.from_pretrained(model_id, { dtype: 'fp32' });
const processor = await AutoProcessor.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);

// Load image and prepare vision inputs
const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Specify task and prepare text inputs
const task = 'MORE_DETAILED_CAPTION';
const prompts = processor.construct_prompts(task);
const text_inputs = tokenizer(prompts);

// Generate text
const generated_ids = await model.generate(
    ...text_inputs,
    ...vision_inputs,
    { max_new_tokens: 100 }
);

// Decode generated text
const generated_text = tokenizer.batch_decode(generated_ids, { skip_special_tokens: false })[0];

// Post-process the generated text
const result = processor.post_process_generation(generated_text, task, image.size);
console.log(result);  // Outputs: A green car is parked in front of a tan building. There is a brown door on the building behind the car. There are two windows on the front of the building.

Understanding the Code

In the code above, we start by importing the necessary components. We then load the model, processor, and tokenizer, akin to gathering all the ingredients. The image is loaded from a URL, just like sourcing fresh produce. We specify the task of generating a more detailed caption and prepare the inputs. The model generates the output, which is our fancy dish—and finally, we decode and process this output into a comprehensible format!

Troubleshooting

While this guide covers the basics, you might run into issues. Here are a few troubleshooting tips:

  • Model not loading: Ensure that the model ID is correctly typed and that your internet connection is stable.
  • Errors during installation: Verify that Node.js is installed correctly, and that you are using the proper command for installation.
  • Image not loading: Check that the image URL is correct and that it is accessible.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Additional Notes

It’s worth noting that having a separate repository for ONNX weights is intended to be a temporary solution. For web-ready models, consider converting to ONNX using Optimum and structuring your repository accordingly.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox