How to Use Florence-2 for Image Captioning with Transformers.js

Jul 1, 2024 | Educational

Are you ready to unlock the power of image captioning using the Florence-2 model and the Transformers.js library? This guide will walk you through the steps to set up your environment and generate detailed captions from images. Let’s dive in!

What You Will Need

  • A JavaScript environment (Node.js recommended).
  • Transformers.js library installed from source.

Setting Up Transformers.js

To get started, ensure you have the correct version of Transformers.js. Florence-2 support is experimental and requires you to install it from source.

Follow these steps to install Transformers.js:

npm install xenova/transformers.js#v3

Using Florence-2 for Image Captioning

Once you have Transformers.js installed, you’re ready to implement image captioning! Here’s a step-by-step breakdown using our code example:

import {    
    Florence2ForConditionalGeneration,    
    AutoProcessor,    
    AutoTokenizer,    
    RawImage,
} from '@xenova/transformers';

// Load model, processor, and tokenizer
const model_id = 'onnx-community/Florence-2-base-ft';
const model = await Florence2ForConditionalGeneration.from_pretrained(model_id, { dtype: 'fp32' });
const processor = await AutoProcessor.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);

// Load image and prepare vision inputs
const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Specify task and prepare text inputs
const task = '';
const prompts = processor.construct_prompts(task);
const text_inputs = tokenizer(prompts);

// Generate text
const generated_ids = await model.generate({
    ...text_inputs,
    ...vision_inputs,
    max_new_tokens: 100,
});

// Decode generated text
const generated_text = tokenizer.batch_decode(generated_ids, { skip_special_tokens: false })[0];

// Post-process the generated text
const result = processor.post_process_generation(generated_text, task, image.size);
console.log(result);

Understanding the Code: A Creative Analogy

Imagine you’re an artist preparing to paint a masterpiece. First, you need to gather your supplies — your palette (the model), brushes (processor), and canvas (tokenizer). You then pick a stunning photograph (the image) that will inspire your artwork. Once you have it, you brainstorm (construct prompts) about what you want to depict. You mix colors (vision inputs and text inputs) and finally start painting (generate text), resulting in a vibrant piece (captions) that tells a story. Each step is crucial to ensure your artwork captures the viewer’s attention perfectly!

Common Troubleshooting Tips

If you encounter issues during setup or execution, here are some troubleshooting ideas:

  • Installation Problems: Ensure you follow the correct steps to install Transformers.js from source. Check your Node.js version.
  • Image Loading Errors: Verify that the image URL is correct and accessible. Test it in a browser to ensure it works.
  • Model or Processor Not Found: Make sure you’ve used the correct model ID when loading the Florence-2 components.
  • Output Issues: If the generated text isn’t making sense, consider adjusting the task prompts being sent to the model.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Using the Florence-2 model for image captioning opens up a world of possibilities. With the right setup, you can transform images into narratives that captivate users. Explore, experiment, and enjoy the creative process!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox