Multimodal Prompting with Missing Modalities for Visual Recognition

Aug 19, 2021 | Data Science

Welcome to our deep dive into the groundbreaking work titled “Multimodal Prompting with Missing Modalities for Visual Recognition,” presented at CVPR 2023. This project addresses the challenges faced in multimodal learning, particularly when certain modalities are absent during real-world applications. In this user-friendly guide, we will walk through the setup, usage, evaluation, and troubleshooting of the official PyTorch implementation.

Introduction

This paper aims to resolve two significant challenges: dealing with missing modalities during both training and testing, and managing limited computational resources without requiring extensive fine-tuning on heavy transformer models. To tackle these issues, we introduce a modality-missing-aware prompt system that integrates seamlessly into multimodal transformers. This innovation utilizes less than 1% of the parameters needed for traditional training, making it highly efficient.

Model Illustration

Usage

Environment

To get started, ensure you have the following prerequisites:

  • Python = 3.7.13
  • Pytorch = 1.10.0
  • CUDA = 11.3

For other requirements, run:

pip install -r requirements.txt

Prepare Dataset

We utilize three distinct vision and language datasets:

Download the datasets manually and employ pyarrow for serialization. The scripts for conversion can be found in vilt/utils/write_*.py. For correct dataset organization, refer to DATA.md, otherwise you may need to modify the write_* scripts to match your dataset paths. To create the pyarrow binary file, execute:

python make_arrow.py --dataset [DATASET] --root [YOUR_DATASET_ROOT]

Evaluation

To evaluate your model, run the following script:

python run.py --data_root=ARROW_ROOT --num_gpus=NUM_GPUS --num_nodes=NUM_NODES --per_gpu_batchsize=BS_FITS_YOUR_GPU --task=task_finetune_mmimdb or task_finetune_food101 or task_finetune_hatememes --load_path=MODEL_PATH --exp_name=EXP_NAME --prompt_type=PROMPT_TYPE --test_ratio=TEST_RATIO --test_type=TEST_TYPE --test_only=True

Training

To begin training, follow these steps:

  1. Download the pre-trained ViLT model weights from here.
  2. Start training with the command:
python run.py --data_root=ARROW_ROOT --num_gpus=NUM_GPUS --num_nodes=NUM_NODES --per_gpu_batchsize=BS_FITS_YOUR_GPU --task=task_finetune_mmimdb or task_finetune_food101 or task_finetune_hatememes --load_path=PRETRAINED_MODEL_PATH --exp_name=EXP_NAME

Troubleshooting

If you encounter issues during setup or execution, consider the following troubleshooting tips:

  • Make sure all dependencies are correctly installed. Rerun pip install -r requirements.txt to ensure this.
  • Double-check that your dataset paths are correctly specified in the scripts.
  • If you face runtime errors related to GPU resources, verify that your batch size is appropriate for your GPU’s capacity.
  • Should you have further questions or require collaboration on AI projects, feel free to reach out. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox