UForm: Pocket-Sized Multimodal AI for Content Understanding and Generation

Apr 26, 2024 | Educational

Revolutionizing Image Captioning and Visual Question Answering

In the fast-paced realm of AI, the ability to understand and generate content from images is a game-changer. Welcome to UForm-Gen, a cutting-edge generative model designed for image captioning and visual question answering. In this article, we’ll explore how to utilize this powerful tool, how it works, and troubleshooting tips to ensure a seamless experience.

Description

UForm-Gen is a compact generative vision-language model primarily crafted to cater to the needs of image captioning and visual question answering. This innovative model comprises two main components:

CLIP-like ViT-H14
Qwen1.5-0.5B-Chat – Check it out here: HuggingFace Qwen1.5

Pre-trained on an internal image captioning dataset and fine-tuned on public instruction datasets like SVIT, LVIS, and various VQA datasets, this model showcases its prowess. Remarkably, it took only a day to train on a DGX-H100 equipped with 8x H100 GPUs, thanks to the computational wizardry of Nebius.ai.

How to Use UForm-Gen

Using UForm-Gen is a straightforward process. Along the way, let’s make an analogy: think of UForm-Gen as a chef in a kitchen. You provide ingredients (image and text), and the chef crafts delightful dishes (captions and answers) based on your instructions. Here are the steps to guide you:

First, install the necessary libraries:

pip install transformers

Then, import the required modules from the transformers library:

from transformers import AutoModel, AutoProcessor

Load the model and processor:

model = AutoModel.from_pretrained("unum-cloud/uform-gen2-qwen-500m", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("unum-cloud/uform-gen2-qwen-500m", trust_remote_code=True)

Prepare your prompt and image:

prompt = "Your question or instruction here"
image = Image.open("image.jpg")

Create inputs using the processor, then generate output:

inputs = processor(text=prompt, images=image, return_tensors="pt")
with torch.inference_mode():
    output = model.generate(**inputs,
                               do_sample=False,
                               use_cache=True,
                               max_new_tokens=256,
                               eos_token_id=151645, 
                               pad_token_id=processor.tokenizer.pad_token_id)

Finally, decode the output:

decoded_text = processor.batch_decode(output, prompt_len=inputs.input_ids.shape[1])

Evaluation

The performance of UForm-Gen can be benchmarked against other models. Here’s a quick comparison of UForm-Gen (0.5B) against some other popular models:

Model	LLM Size	SQA	MME	MMBench	Average
UForm-Gen2-Qwen-500m	0.5B	45.5	880.1	42.0	29.3
MobileVLM v2	1.4B	52.1	1302.8	57.7	36.8
LLaVA-Phi	2.7B	68.4	1335.1	59.8	42.9

Troubleshooting

If you encounter issues while using the model, here are some troubleshooting ideas:

Ensure you have the correct library versions installed.
Verify that your image file is within acceptable dimensions and formats (e.g., JPG, PNG).
Check if your prompts are clear and contextually appropriate.
If the model fails to generate any output, consider retrying with simplified prompts.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox