How to Use the Finetuned DistilBERT Model with CLIP for Text Encoding in Turkish

Sep 10, 2024 | Educational

How to Use the Finetuned DistilBERT Model with CLIP for Text Encoding in Turkish

This guide will walk you through the process of implementing a finetuned version of the dbmdz/distilbert-base-turkish-cased model for use as a text encoder with OpenAI’s CLIP image encoder, specifically the ViT-B32 variant. This implementation is particularly beneficial for Turkish text, allowing users to easily encode text and images, and derive insightful results.

Getting Started

Before you dive into the code, ensure you have the necessary libraries installed:

  • Transformers
  • Tensolflow
  • NumPy
  • Pillow for image processing
  • PyTorch for handling the CLIP model

Setup the code

The following Python script demonstrates how to set up the model:

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
import numpy as np
from PIL import Image
import torch
import clip

model_name = "mys/distilbert-base-turkish-cased"
base_model = TFAutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
head_model = tf.keras.models.load_model("clip_head.h5")

def encode_text(base_model, tokenizer, head_model, texts):
    tokens = tokenizer(texts, padding=True, return_tensors=tf)
    embs = base_model(**tokens)[0]
    attention_masks = tf.cast(tokens["attention_mask"], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    base_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
    clip_embs = head_model(base_embs)
    clip_embs = tf.norm(clip_embs, axis=-1, keepdims=True)
    return clip_embs

demo_images = {
    "bilgisayarda çalışan bir insan": "myspc.jpeg",
    "sahilde bir insan ve bir heykel": "mysdk.jpeg"
}

clip_model, preprocess = clip.load("ViT-B32")
images = {key: Image.open(value) for key, value in demo_images.items()}
img_inputs = torch.stack([preprocess(image).to("cpu") for image in images.values()])

with torch.no_grad():
    image_embs = clip_model.encode_image(img_inputs).float().to("cpu")
    image_embs = image_embs.norm(dim=-1, keepdim=True).detach().numpy()

text_embs = encode_text(base_model, tokenizer, head_model, list(images.keys())).numpy()

similarities = image_embs @ text_embs.T
logits = tf.nn.softmax(tf.convert_to_tensor(similarities)).numpy()
idxs = np.argmax(logits, axis=-1).tolist()

for i, (key, value) in enumerate(demo_images.items()):
    print(f"path: {value}, true label: {key}, prediction: {list(demo_images.keys())[idxs[i]]}, score: {logits[i, idxs[i]]}")

Understanding the Code: An Analogy

To help you understand the above code better, let’s use an analogy. Imagine a library (the DistilBERT model) full of books (words and sentences in Turkish) that need to be summarized (encoded). The librarian (tokenizer) is responsible for organizing books into categories (tokens) so that it’s easier to process them. Once organized, the librarian hands the categories to a highly skilled editor (base_model) that reads through the summaries and creates a coherent overview (embeddings). Afterward, the final summary is passed to a publisher (head_model) that formats the summary in a specific style (CLIP vector space). Finally, we compare our summaries to images (image encoding) to find out which ones match best.

How to Run the Code

Now that you understand the structure and function of the code, here’s how to execute it:

  1. Copy the code into a Python script.
  2. Download the required model files and images as specified.
  3. Run the script from your terminal/cmd using Python.

Troubleshooting

If you encounter any issues during your setup or execution, consider these troubleshooting tips:

  • Double-check if all required libraries are installed and updated to the latest version.
  • Ensure your model paths and image paths are correctly specified.
  • If you face compatibility issues, review the documentation for both the Transformers library and CLIP for any discrepancies.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By utilizing the finetuned DistilBERT model alongside CLIP, you can achieve seamless text-to-image matching for Turkish language inputs. This opens up new possibilities for applications in various fields such as natural language processing, image analysis, and more.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox