VQA/keras implementation/README.md at 943ee3399348806f68f3eb52468be7097540c829 · Cloud-CV/VQA

Visual Question Answering in Keras and Tensorflow Backend (by Fenil Doshi)

Implementation of the VQA paper. Website for Visual Question Answering -http://visualqa.org/

Problem Description

Given an image and a natural Language question the task is to give a natural language answer. This is approached by encoding an image in a 4096 dimensional space which can be done by passing it through a VGG model. We will be removing the last 2 max pool layers in order to get the required dimension. The question can be encoded in 2 ways:

Bag of Words
Using Recurrent Neural Networks Once the question is encoded, the encoded image and encoded question are merged together and passes through a feed forward deep net. Finally we compute the answers from one of the 1000 classes(as we will take into account only the 1000 most frequently occuring answers).

Requirements

Tensorflow
Keras
scipy
spacy
sklearn , numpy
nltk
NVIDIA CUDA

download the spacy English glove vectors from https://nlp.stanford.edu/projects/glove/

Dataset

Dataset Download link - http://visualqa.org/download.html For more info on data preprocessing checkout the data folder in this directory

To get started

Instruction in readme files in every folder

Results

The 2 stacked GRU+CNN model converged faster than the corresponding LSTM over the same training set. I also figured out that the SGD optimizer worked better than RMSProp in normal LSTMs/GRUs but RMSProp worked better in case of a Time Distributed Layer. GRU with the time distributed layer gave a very low accuracy.

The models can be improvised way further by training it on the entire dataset for about >100 epochs on a better GPU(Tesla or GTX 1080 ). Overfitting can further be reduced by using Dropout and Regularization.

Validation Accuracy of LSTM + CNN = 33.77 %
Validation Accuracy of GRU + CNN = 34.4 %
Validation Accuracy of LSTM + Time Distributed Layer + CNN = 34.3 %

These accuracies are by training over just 10,000 examples. The accuracy can be improved by training over a larger set every epoch and over a better GPU. Currently trained it on a NVIDIA GTX 960M which took around 3 hours to train 10,000 images for 100 epochs.

Some Improvements that can be made

Better hyperparameter tuning
Dropout
Regularization
Using a RNN decoder for answers to get answers with temporal semantics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visual Question Answering in Keras and Tensorflow Backend (by Fenil Doshi)

Problem Description

Requirements

Dataset

To get started

Results

Some Improvements that can be made

References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Visual Question Answering in Keras and Tensorflow Backend (by Fenil Doshi)

Problem Description

Requirements

Dataset

To get started

Results

Some Improvements that can be made

References