Multi-Modal Inference Library for Semantic Search Applications
UForm is a powerful Multi-Modal Inference package designed to encode multi-lingual texts, images, and soon audio, video, and documents into a shared vector space! In this article, we will discuss how to set up and use UForm through a user-friendly guide.
Getting Started with UForm
Before diving into the code, ensure you have Python installed along with necessary libraries. Here’s how you can install UForm:
bash
pip install uform[torch]
Loading the Model
Once you have installed UForm, you can load the multilingual model using the following code:
python
import uform
model, processor = uform.get_model("unum-cloud/uform-vl-multilingual-v2")
Encoding Data
Imagine UForm as a translator that can not only understand different languages (texts) but also visual languages (images). Here’s how to encode data:
python
from PIL import Image
text = "a small red panda in a zoo"
image = Image.open("red_panda.jpg")
image_data = processor.preprocess_image(image)
text_data = processor.preprocess_text(text)
image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)
The analogy here is to think of the image and text as puzzle pieces. When you process and encode them, you are essentially creating connections between these different types of information, allowing them to fit together seamlessly.
Extracting Features
After encoding, if you’d like to extract features quickly from an image or text without going through the entire model again, you can do so as follows:
python
image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
joint_embedding = model.encode_multimodal(
image_features=image_features,
text_features=text_features,
attention_mask=text_data["attention_mask"]
)
Calculating Semantic Compatibility
To find out how closely related the image is to the text, you have two methods to choose from: Cosine Similarity and Matching Score.
Cosine Similarity
This is a quick and computationally efficient method to estimate compatibility:
python
import torch.nn.functional as F
similarity = F.cosine_similarity(image_embedding, text_embedding)
Here, a value of 1 indicates an absolute match. However, it’s important to note that while this method is fast, it only considers broad features.
Matching Score
For a more refined assessment, you can use the Matching Score, which requires joint embedding:
python
score = model.get_matching_scores(joint_embedding)
This method effectively captures fine-grained features and is useful for re-ranking results, although it’s more resource-intensive.
Troubleshooting
- Issue: Model not loading – Ensure you have a stable internet connection and the proper installation of dependencies.
- Issue: Image not found – Make sure the file name and path are correct in the code.
- Issue: Memory errors – If faced with memory issues, consider reducing the image size or closing other applications consuming memory.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
With UForm, the possibilities for multi-modal applications are vast. From understanding diverse languages to processing images, this library acts like a master interpreter, facilitating meaningful interactions between different forms of data.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

