UForm: A Multi-Modal Inference Library

Mar 29, 2024 | Educational

For Semantic Search Applications

Welcome to the world of UForm, a powerful Multi-Modal Inference library that unlocks the potential of semantic search by encoding various forms of data—texts, images, and soon audio, video, and documents—into a shared vector space. With UForm, you can effectively manage and query data across multiple modalities using state-of-the-art models.

Understanding the Model

This guide focuses on the model optimized for English, featuring:

  • 12 layers BERT (6 layers dedicated to unimodal encoding and the rest for multimodal encoding)
  • ViT-L14 (with an image resolution of 224×224)
  • Multiple embedding sizes: 64, 256, 512, 768

If you’re interested in a multilingual model, you can check it here.

Performance Evaluation

UForm’s effectiveness is quantifiable according to several evaluation metrics derived from multimodal re-ranking (text-to-image retrieval) on various datasets, as follows:

Dataset Recall@1 Recall@5 Recall@10
Zero-Shot Flickr 0.693 0.875 0.923
Zero-Shot MS-COCO 0.382 0.617 0.728
ImageNet-Top1 0.518
ImageNet-Top5 0.756

Installation

Installing UForm is as simple as executing a single command in your terminal:

bash
pip install uform[torch]

Usage Guide

To unleash the capabilities of UForm, follow these steps.

Loading the Model

python
import uform
model, processor = uform.get_model('unum-cloud/uform-vl-english-large')

Encoding Data

Encoding your data is a breeze:

python
from PIL import Image
text = "a small red panda in a zoo"
image = Image.open("red_panda.jpg")
image_data = processor.preprocess_image(image)
text_data = processor.preprocess_text(text)
image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)

Getting Features

The features derived can enhance processing speed for future tasks:

python
image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask'])

Measuring Similarity

UForm provides two options for determining semantic compatibility between text and image: Cosine Similarity and Matching Score.

Cosine Similarity

This option is computationally inexpensive:

python
import torch.nn.functional as F
similarity = F.cosine_similarity(image_embedding, text_embedding)

Keep in mind that this method reveals results in the range of [-1, 1], where 1 represents a perfect match.

Matching Score

Here, you’ll need joint embedding:

python
score = model.get_matching_scores(joint_embedding)

This produces scores between [0, 1], where 1 indicates a perfect match, capturing finer details but requires more computational resources.

Troubleshooting

If you encounter issues at any point while using UForm, consider the following troubleshooting ideas:

  • Make sure all required dependencies are installed correctly.
  • Check that the image and text inputs are properly formatted.
  • Verify internet connection if loading models from the cloud.
  • Revisit the code snippets to ensure they comply with Python syntax accurately.
  • For further assistance, delve into community forums or the Hugging Face documentation.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox