The image features are extracted using the bottom-up-attention strategy, with each image being represented as a dynamic number (k=[10,100]) of 2048-D features. We store the features for each image in a .npz file. You can prepare the visual features by yourself or download the extracted features from OneDrive or BaiduYun. The downloaded files contains three files: train2014.tar.gz, val2014.tar.gz, and test2015.tar.gz, corresponding to the features of the train/val/test images for VQA-v2, respectively. Run the following commands to unzip the features:
$ mkdir data/vqa/bua-r101-max100
$ tar -xzvf train2014.tar.gz -C data/vqa/bua-r101-max100/
$ tar -xzvf val2014.tar.gz -C data/vqa/bua-r101-max100/
$ tar -xzvf test2015.tar.gz -C data/vqa/bua-r101-max100/Then download the QA files for VQA-v2. Besides, we use the VQA samples from the visual genome dataset(VG) to expand the training samples. Similar to existing strategies, we preprocessed the samples by two rules:
- Select the QA pairs with the corresponding images appear in the MSCOCO train and val splits.
- Select the QA pairs with the answer appear in the processed answer list (occurs more than 8 times in whole VQA-v2 answers).
For convenience, we provide our processed VG questions and annotations files, you can download them from OneDrive or BaiduYun. Place all annotation files into the folder data/vqa/annotations.
Finally, the data folder will have the following structure:
|-- data
|-- vqa
|-- bua-r101-max100
| |-- train2014
| | |-- COCO_train2014_...jpg.npz
| | |-- ...
| |-- val2014
| | |-- COCO_val2014_...jpg.npz
| | |-- ...
| |-- test2015
| | |-- COCO_test2015_...jpg.npz
| | |-- ...
|-- annotations
| |-- v2_OpenEnded_mscoco_train2014_questions.json
| |-- v2_OpenEnded_mscoco_val2014_questions.json
| |-- v2_OpenEnded_mscoco_test2015_questions.json
| |-- v2_OpenEnded_mscoco_test-dev2015_questions.json
| |-- v2_mscoco_train2014_annotations.json
| |-- v2_mscoco_val2014_annotations.json
| |-- VG_questions.json
| |-- VG_annotations.json
The image features for VGD task are also extracted using the bottom-up-attention strategy, with each image being represented as a fixed number (k=100) of 2048-D features extracted from a pretrained Faster-RCNN model. For training the Faster RCNN model, we exclude the image that overlapped with the RefCOCO/RefCOCO+/RefCOCOg to avoid contamination of the visual grounding datasets. Similar to VQA, the features for each image are stored in a .npz file. We provide the extracted features on OneDrive. After downloaded the zipped files, you can run the following commands to obtain the features in the right place.
$ cat vgd-bua-fix100.tar.gz* | tar xz
$ mv vgd-bua-fix100 data/vgd/bua-r101-fix100The annotation files for RefCOCO, RefCOCO+, RefCOCOg can be downloaded from its original repository here. We provide the scripts as follows to preprocess them into our desired format:
$ python tools/ref_process.py
$ python tools/ref_process_plus.py
$ python tools/ref_process_g.pyFinally, the data folder will have the following structure:
|-- data
|-- vgd
|-- bua-r101-fix100
|-- refcoco
| |-- train.json
| |-- val.json
| |-- testA.json
| |-- testB.json
|-- refcoco+
| |-- train.json
| |-- val.json
| |-- testA.json
| |-- testB.json
|-- refcocog
| |-- train.json
| |-- val.json
| |-- test.json
Additionally, it is also needed to build as follows:
$ cd mmnas/utils
$ python3 setup.py build
$ cp build/lib.*/*.so .
$ cd ../..
Following the strategy in SCAN, the image features for ITM are also extracted using the bottom-up-attention strategy, with each image being represented as a fixed number (k=36) of 2048-D features. We store the features for each image in a .npz file. We provide the extracted features on OneDrive. After downloaded the zipped files, you can run the following commands to obtain the features in the right place.
$ cat itm-bua-fix36.tar.gz* | tar xz
$ mv itm-bua-fix36 data/itm/flickr_bua-r101-fix36The annotation files of the Flickr30K dataset can be downloaded here and here to extract the f30k_precomp folder and the dataset_flickr30k.json file, respectively.
Finally, the data folder will have the following structure:
|-- data
|-- itm
|-- flickr_bua-r101-fix36
|-- dataset_flickr30k.json
|-- f30k_precomp