How to Utilize the Multilingual CLIP-ViT-B-32 Model for Image and Text Embedding

Mar 24, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_27_146

Welcome to this practical guide on using the multilingual version of the OpenAI CLIP-ViT-B-32 model. With this powerful tool, you can efficiently map text and images from over 50 languages into a common dense vector space. This functionality opens up exciting possibilities for image search and multi-lingual zero-shot image classification.

Setting Up the Environment

The first step towards making the most out of the CLIP-ViT-B-32 model is installing the necessary package:

pip install -U sentence-transformers

Loading the Model

Once you’ve got the package installed, you can load the model using Python. Think of loading a model as preparing a chef’s ingredients before cooking. For this example, the image model serves as our main chef, while the text model prepares the flavorings from linguistic variety:

from sentence_transformers import SentenceTransformer, util
from PIL import Image, ImageFile
import requests
import torch

img_model = SentenceTransformer('clip-ViT-B-32')  # Our image encoder
text_model = SentenceTransformer('sentence-transformers/clip-ViT-B-32-multilingual-v1')  # Text encoder

Loading and Encoding Images

Next, let’s proceed to load and encode the images. You can think of this step as loading the ingredients for our recipe:

def load_image(url_or_path):
    if url_or_path.startswith('http:') or url_or_path.startswith('https:'):
        return Image.open(requests.get(url_or_path, stream=True).raw)
    else:
        return Image.open(url_or_path)

img_paths = [
    'https://unsplash.com/photos/QtxgNsmJQSs/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjM1ODQ0MjY3',
    'https://unsplash.com/photos/9UUoGaaHtNE/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8Mnx8Y2F0fHwwfHx8fDE2MzU4NDI1ODQw',
    'https://unsplash.com/photos/Siuwr3uCir0/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8NHx8YmVhY2h8fDB8fHx8MTYzNTg0MjYzMg'
]
images = [load_image(img) for img in img_paths]

img_embeddings = img_model.encode(images)  # Encode the images

Encoding Text

Now, let’s spice things up by encoding some text that describes these images:

texts = [
    'A dog in the snow',
    'Eine Katze',  # German: A cat
    'Una playa con palmeras.'  # Spanish: A beach with palm trees
]
text_embeddings = text_model.encode(texts)  # Encode the text

Computing Similarities

Finally, it’s time to compare the flavors! This step involves computing the cosine similarities, enabling us to match images with their corresponding text. Imagine tasting the final dish to see if all flavors come together perfectly:

cos_sim = util.cos_sim(text_embeddings, img_embeddings)

for text, scores in zip(texts, cos_sim):
    max_img_idx = torch.argmax(scores)
    print('Text:', text)
    print('Score:', scores[max_img_idx])
    print('Path:', img_paths[max_img_idx], '\n')

Multilingual Image Search – Demo

For an interactive demonstration, you can try out the multilingual image search demo on GitHub or access the Colab version.

Troubleshooting

If you run into issues during installation or execution, here are a few troubleshooting tips:

Ensure you have the correct Python version installed. Compatibility issues can arise with different versions.
Double-check that you’ve installed the necessary dependencies like `sentence-transformers` and `torch`.
If images don’t load, confirm that the URLs are correct and accessible.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By utilizing the multilingual version of the CLIP-ViT-B-32 model, you can effectively bridge the gap between images and text across languages. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox