For Semantic Search Applications
Welcome to the world of UForm, a powerful Multi-Modal Inference library that unlocks the potential of semantic search by encoding various forms of data—texts, images, and soon audio, video, and documents—into a shared vector space. With UForm, you can effectively manage and query data across multiple modalities using state-of-the-art models.
Understanding the Model
This guide focuses on the model optimized for English, featuring:
- 12 layers BERT (6 layers dedicated to unimodal encoding and the rest for multimodal encoding)
- ViT-L14 (with an image resolution of 224×224)
- Multiple embedding sizes: 64, 256, 512, 768
If you’re interested in a multilingual model, you can check it here.
Performance Evaluation
UForm’s effectiveness is quantifiable according to several evaluation metrics derived from multimodal re-ranking (text-to-image retrieval) on various datasets, as follows:
| Dataset | Recall@1 | Recall@5 | Recall@10 |
|---|---|---|---|
| Zero-Shot Flickr | 0.693 | 0.875 | 0.923 |
| Zero-Shot MS-COCO | 0.382 | 0.617 | 0.728 |
| ImageNet-Top1 | 0.518 | – | – |
| ImageNet-Top5 | 0.756 | – | – |
Installation
Installing UForm is as simple as executing a single command in your terminal:
bash
pip install uform[torch]
Usage Guide
To unleash the capabilities of UForm, follow these steps.
Loading the Model
python
import uform
model, processor = uform.get_model('unum-cloud/uform-vl-english-large')
Encoding Data
Encoding your data is a breeze:
python
from PIL import Image
text = "a small red panda in a zoo"
image = Image.open("red_panda.jpg")
image_data = processor.preprocess_image(image)
text_data = processor.preprocess_text(text)
image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)
Getting Features
The features derived can enhance processing speed for future tasks:
python
image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)
joint_embedding = model.encode_multimodal(
image_features=image_features,
text_features=text_features,
attention_mask=text_data['attention_mask'])
Measuring Similarity
UForm provides two options for determining semantic compatibility between text and image: Cosine Similarity and Matching Score.
Cosine Similarity
This option is computationally inexpensive:
python
import torch.nn.functional as F
similarity = F.cosine_similarity(image_embedding, text_embedding)
Keep in mind that this method reveals results in the range of [-1, 1], where 1 represents a perfect match.
Matching Score
Here, you’ll need joint embedding:
python
score = model.get_matching_scores(joint_embedding)
This produces scores between [0, 1], where 1 indicates a perfect match, capturing finer details but requires more computational resources.
Troubleshooting
If you encounter issues at any point while using UForm, consider the following troubleshooting ideas:
- Make sure all required dependencies are installed correctly.
- Check that the image and text inputs are properly formatted.
- Verify internet connection if loading models from the cloud.
- Revisit the code snippets to ensure they comply with Python syntax accurately.
- For further assistance, delve into community forums or the Hugging Face documentation.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

