UForm

Mar 30, 2024 | Educational

Multi-Modal Inference Library for Semantic Search Applications

UForm is a powerful Multi-Modal Inference package designed to encode multi-lingual texts, images, and soon audio, video, and documents into a shared vector space! In this article, we will discuss how to set up and use UForm through a user-friendly guide.

Getting Started with UForm

Before diving into the code, ensure you have Python installed along with necessary libraries. Here’s how you can install UForm:

bash
pip install uform[torch]

Loading the Model

Once you have installed UForm, you can load the multilingual model using the following code:

python
import uform
model, processor = uform.get_model("unum-cloud/uform-vl-multilingual-v2")

Encoding Data

Imagine UForm as a translator that can not only understand different languages (texts) but also visual languages (images). Here’s how to encode data:

python
from PIL import Image

text = "a small red panda in a zoo"
image = Image.open("red_panda.jpg")
image_data = processor.preprocess_image(image)
text_data = processor.preprocess_text(text)

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)

The analogy here is to think of the image and text as puzzle pieces. When you process and encode them, you are essentially creating connections between these different types of information, allowing them to fit together seamlessly.

Extracting Features

After encoding, if you’d like to extract features quickly from an image or text without going through the entire model again, you can do so as follows:

python
image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data["attention_mask"]
)

Calculating Semantic Compatibility

To find out how closely related the image is to the text, you have two methods to choose from: Cosine Similarity and Matching Score.

Cosine Similarity

This is a quick and computationally efficient method to estimate compatibility:

python
import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

Here, a value of 1 indicates an absolute match. However, it’s important to note that while this method is fast, it only considers broad features.

Matching Score

For a more refined assessment, you can use the Matching Score, which requires joint embedding:

python
score = model.get_matching_scores(joint_embedding)

This method effectively captures fine-grained features and is useful for re-ranking results, although it’s more resource-intensive.

Troubleshooting

Issue: Model not loading – Ensure you have a stable internet connection and the proper installation of dependencies.
Issue: Image not found – Make sure the file name and path are correct in the code.
Issue: Memory errors – If faced with memory issues, consider reducing the image size or closing other applications consuming memory.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With UForm, the possibilities for multi-modal applications are vast. From understanding diverse languages to processing images, this library acts like a master interpreter, facilitating meaningful interactions between different forms of data.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox