How to Use nanoLLaVA with Transformers.js

May 19, 2024 | Educational

Are you ready to dive into the fascinating world of AI and multimodal models? Today, we will explore how to use the nanoLLaVA model in your projects with Transformers.js. Whether you’re aiming to analyze images and text or simply want to add some AI magic to your web applications, you’re in the right place!

What You Need to Get Started

First, ensure you have the following requirements covered:

Transformers.js: The library we’ll be using. Make sure to use version 3, which you can install from GitHub using the following command:

npm install xenova/transformers.js#v3

Node.js: Ensure you have Node.js installed on your machine for JavaScript development.

Loading the Model

Let’s walk through the process of loading the nanoLLaVA model. Think of this like packing a toolbox before starting your fix-it project. Each tool (or component of the model) is essential for a smooth operation!

First, import the necessary components here:

import AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration, RawImage from @xenova/transformers;

Initialize the model, tokenizer, and processor:

const model_id = 'Xenova/nanoLLaVA';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaForConditionalGeneration.from_pretrained(model_id, {
    dtype: 'fp16', // Choose dtype according to your preference
    vision_encoder: 'fp16',  // Vision encoder parameters
    decoder_model_merged: 'q4', // Decoder model options
    device: 'webgpu', // Device to use
});

Preparing Your Inputs

Imagine you’re prepping ingredients before cooking. You need to prepare both text and vision inputs to ensure everything blends well together:

Text Inputs:

const prompt = 'What does the text say?';
const messages = [
    { role: 'system', content: 'Answer the question.' },
    { role: 'user', content: `image$${prompt}` }
];
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true });
const text_inputs = tokenizer(text);

Vision Inputs:

const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/blob/main/example_1.png';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

Generating a Response

Now, let’s generate a response! Think of it as asking a chef to comment on the meal you’ve prepared. Here’s how to do it:

const past_key_values, sequences = await model.generate(
    ...text_inputs,
    ...vision_inputs,
    { do_sample: false, max_new_tokens: 64, return_dict_in_generate: true }
);
const answer = tokenizer.decode(sequences.slice(0, [text_inputs.input_ids.dims[1], null]), { skip_special_tokens: true });

Finally, log the answer:

console.log(answer); // The text reads: Small but mighty.

Troubleshooting

If you run into issues while using nanoLLaVA, consider the following troubleshooting tips:

Ensure that Transformers.js is installed properly and updated to version 3.
Check that the provided URLs for the model and image are correct and accessible.
Make sure your environment supports WebGPU for enhanced performance.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Using nanoLLaVA with Transformers.js can open doors to exciting AI applications. Whether you are processing images or generating text, understanding this powerful integration will enhance your toolkit.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox