Welcome to this practical guide on using the multilingual version of the OpenAI CLIP-ViT-B-32 model. With this powerful tool, you can efficiently map text and images from over 50 languages into a common dense vector space. This functionality opens up exciting possibilities for image search and multi-lingual zero-shot image classification.
Setting Up the Environment
The first step towards making the most out of the CLIP-ViT-B-32 model is installing the necessary package:
pip install -U sentence-transformers
Loading the Model
Once you’ve got the package installed, you can load the model using Python. Think of loading a model as preparing a chef’s ingredients before cooking. For this example, the image model serves as our main chef, while the text model prepares the flavorings from linguistic variety:
from sentence_transformers import SentenceTransformer, util
from PIL import Image, ImageFile
import requests
import torch
img_model = SentenceTransformer('clip-ViT-B-32') # Our image encoder
text_model = SentenceTransformer('sentence-transformers/clip-ViT-B-32-multilingual-v1') # Text encoder
Loading and Encoding Images
Next, let’s proceed to load and encode the images. You can think of this step as loading the ingredients for our recipe:
def load_image(url_or_path):
if url_or_path.startswith('http:') or url_or_path.startswith('https:'):
return Image.open(requests.get(url_or_path, stream=True).raw)
else:
return Image.open(url_or_path)
img_paths = [
'https://unsplash.com/photos/QtxgNsmJQSs/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjM1ODQ0MjY3',
'https://unsplash.com/photos/9UUoGaaHtNE/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8Mnx8Y2F0fHwwfHx8fDE2MzU4NDI1ODQw',
'https://unsplash.com/photos/Siuwr3uCir0/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8NHx8YmVhY2h8fDB8fHx8MTYzNTg0MjYzMg'
]
images = [load_image(img) for img in img_paths]
img_embeddings = img_model.encode(images) # Encode the images
Encoding Text
Now, let’s spice things up by encoding some text that describes these images:
texts = [
'A dog in the snow',
'Eine Katze', # German: A cat
'Una playa con palmeras.' # Spanish: A beach with palm trees
]
text_embeddings = text_model.encode(texts) # Encode the text
Computing Similarities
Finally, it’s time to compare the flavors! This step involves computing the cosine similarities, enabling us to match images with their corresponding text. Imagine tasting the final dish to see if all flavors come together perfectly:
cos_sim = util.cos_sim(text_embeddings, img_embeddings)
for text, scores in zip(texts, cos_sim):
max_img_idx = torch.argmax(scores)
print('Text:', text)
print('Score:', scores[max_img_idx])
print('Path:', img_paths[max_img_idx], '\n')
Multilingual Image Search – Demo
For an interactive demonstration, you can try out the multilingual image search demo on GitHub or access the Colab version.
Troubleshooting
If you run into issues during installation or execution, here are a few troubleshooting tips:
- Ensure you have the correct Python version installed. Compatibility issues can arise with different versions.
- Double-check that you’ve installed the necessary dependencies like `sentence-transformers` and `torch`.
- If images don’t load, confirm that the URLs are correct and accessible.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
By utilizing the multilingual version of the CLIP-ViT-B-32 model, you can effectively bridge the gap between images and text across languages. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

