How to Use the Hindi Image Captioning Model

Sep 12, 2024 | Educational

Welcome to the world of AI and image captioning! In this guide, we will walk you through the steps to utilize an innovative encoder-decoder image captioning model that employs a Vision Transformer (ViT) as an encoder and GPT2-Hindi as a decoder. This groundbreaking approach is designed to generate descriptive captions in Hindi for images from the Flickr8k dataset.

Understanding the Model

Imagine you have a talented artist who views a picture and is immediately inspired to write a poem about it. In our scenario, the artist is replaced by a sophisticated model composed of two parts: a Vision Transformer (ViT) that analyzes the image’s visual aspects and a GPT2 model which elegantly crafts the caption based on its insights.

  • Encoder: ViT – It examines and captures the features of the image.
  • Decoder: GPT2-Hindi – It generates a meaningful and coherent Hindi caption based on the features provided by the encoder.

This synergy allows for a nuanced analysis, transforming visual stimuli into rich textual descriptions.

Setting Up Your Environment

Before diving into coding, ensure you have everything set up correctly. You will need:

  • Python environment
  • Access to the internet for model downloads
  • The required libraries: PyTorch, PIL, and HuggingFace Transformers

How to Use the Model

Here’s a step-by-step guide to using the model to caption an image from the Flickr8k dataset:

python
import torch
import requests
from PIL import Image
from transformers import ViTFeatureExtractor, AutoTokenizer, VisionEncoderDecoderModel

# Setting device to GPU if available
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Load the image
url = "https://shorturl.at/fvxEQ"
image = Image.open(requests.get(url, stream=True).raw)

# Define model checkpoints
encoder_checkpoint = "google/vit-base-patch16-224"
decoder_checkpoint = "surajp/gpt2-hindi"
model_checkpoint = "team-indain/image-captioning"

# Load feature extractor, tokenizer, and model
feature_extractor = ViTFeatureExtractor.from_pretrained(encoder_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(decoder_checkpoint)
model = VisionEncoderDecoderModel.from_pretrained(model_checkpoint).to(device)

# Inference
sample = feature_extractor(image, return_tensors="pt").pixel_values.to(device)
clean_text = lambda x: x.replace("", "").split("\n")[0]
caption_ids = model.generate(sample, max_length=50)[0]
caption_text = clean_text(tokenizer.decode(caption_ids))

print(caption_text)

Training Data

This innovative model utilizes the Flickr8k Hindi Dataset, which is a translated version of the original Flickr8k dataset, available on Kaggle. This dataset is a treasure trove of images and corresponding Hindi captions, making it an ideal choice for training our model.

Training Procedure

This model was meticulously trained during HuggingFace course community week, organized by HuggingFace. The training utilized a Kaggle GPU, ensuring efficient processing and faster model convergence.

Training Parameters

  • Epochs: 8
  • Batch Size: 8
  • Mixed Precision: Enabled

Meet the Team Behind the Model

  • [Sean Benhur](https://www.linkedin.com/in/seanbenhur)
  • [Herumb Shandilya](https://www.linkedin.com/in/herumb-s-740163131)

Troubleshooting

If you encounter issues while implementing the model, consider the following troubleshooting ideas:

  • Ensure all libraries are correctly installed and compatible with your version of Python.
  • Check that your image URL is valid and accessible.
  • Verify your device configuration; make sure PyTorch recognizes CUDA if you’re using GPU.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox