The Neural Visual Question Answering (Neural VQA) model combines visual data with natural language processing to provide answers to questions about images. Based on the VIS + LSTM architecture described in the paper Exploring Models and Data for Image Question Answering by Mengye Ren, Ryan Kiros, and Richard Zemel, this guide will walk you through the steps to set up and use this powerful model.
Setup Requirements
Start by downloading the MSCOCO training and validation images, as well as the VQA data:
sh datadownload_data.sh
Then extract all downloaded zip files into the data folder:
unzip Annotations_Train_mscoco.zip
unzip Questions_Train_mscoco.zip
unzip train2014.zip
unzip Annotations_Val_mscoco.zip
unzip Questions_Val_mscoco.zip
unzip val2014.zip
If you already have these files downloaded, copy the train2014 and val2014 image folders and the VQA JSON files to the data folder. Next, download the VGG-19 Caffe model and prototxt:
sh modelsdownload_models.sh
Known Issues
While setting this up, you might encounter some issues. To avoid memory problems with LuaJIT, make sure to install Torch with Lua 5.1:
TORCH_LUA_VERSION=LUA51 .install.sh
For additional guidance, check here. If you’re using plain Lua, luaffifb may be needed for loadcaffe unless you are working with pre-extracted fc7 features.
Usage Instructions
Extracting Image Features
To extract image features, run:
th extract_fc7.lua -split train
th extract_fc7.lua -split val
You have various options for extracting features, such as batch size, split, and specifying GPU use. The default batch size is 10, and the split can be either training or validation.
Training the Model
To train the model, execute:
th train.lua
Parameter options include:
- rnn_size: Size of LSTM internal state (default is 512)
- num_layers: Number of layers in LSTM
- embedding_size: Size of word embeddings (default is 512)
- learning_rate: Learning rate (default is 4e-4)
- batch_size: Batch size (default is 64)
Testing the Model
To test the model and get predictions, run:
th predict.lua -checkpoint_file checkpoints/vqa_epoch23.26_0.4610.t7 -input_image_path data/train2014/COCO_train2014_000000405541.jpg -question "What is the cat on?"
This command will provide answers to your specified question based on the input image.
Sample Predictions
Here are some sample image-question pairs along with their predicted answers from the VQA model:
- Q: What animals are those? A: Sheep!
- Q: What color is the frisbee that’s upside down? A: Red!
- Q: What is flying in the sky? A: Kite!
Implementation Details
- Utilizes last hidden layer image features from VGG-19.
- Zero-padded question sequences for batched implementation.
- Training questions filtered for top-n answers, with top_n = 1000 by default.
Pretrained Models and Data Files
To reproduce the results or experiment with your own image-question pairs, download the following:
- Model: vqa_epoch23.26_0.4610.t7
- Vocab Files: answers_vocab.t7 and questions_vocab.t7
- Data File: data.t7
Troubleshooting
If you run into issues during installation or execution, here are some tips:
- Ensure that all required files are properly downloaded and extracted.
- Check for the correct version of Lua in case you face any compatibility issues.
- Try adjusting batch sizes or using the CPU instead of GPU if you encounter memory overloads.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.