Code for ICCE-Asia 2023 paper Dual Encoding++: Optimization of Text-Video Retrieval via Fine-tuning and Pruning.
- Ubuntu: 20.04 LTS
- Python: 3.8
- PyTorch: 2.0
- CUDA: 11.7.1
- cuDNN: 8.6.0
MSR-VTT official split
- Train: 6,513 clips, 130,260 captions
- Validation: 497 clips, 9,940 captions
- Evaluation: 2,990 clips, 59,800 captions
We used Miniconda to set up our deep learning workspace. After installing Miniconda, you can create conda environment using conda_env_cuda.yaml.
conda env create --file conda_env_cuda.yaml
Please run the follwing command to download pre-trained video features (4.24 GB) and place
We used the same MSR-VTT video features used in original Dual Encoding. They are the concatenation of ResNeXt-101 and ResNet-152 features. Please download msrvtt10k.tar.gz (4.24 GB) from this Google Drive URL. After downloading it, extract the directory with the following command and place it under data/. For more information, you can refer refer here.
tar -xzvf msrvtt10k.tar.gz
We also used pretrained word2vec embeddings trained on 30M Flickr images' English tags provided by this paper. Please download word2vec.tar.gz (3 GB) from this Google Drive URL (same URL as the above). After downloading it, extract the directory with the following command and place it under data/. For more information, you can refer here.
tar -xzvf word2vec.tar.gz
After all, data/ structure should be as follows.
data
├── msrvtt10k
│ ├── FeatureData
│ │ └── resnext101-resnet152
│ │ ├── feature.bin
│ │ ├── id.txt
│ │ ├── shape.txt
│ │ └── video2frames.txt
│ └── TextData
│ ├── msrvtt10ktest.caption.txt
│ ├── msrvtt10ktrain.caption.txt
│ └── msrvtt10kval.caption.txt
└── word2vec
├── feature.bin
├── id.txt
└── shape.txt
- To craete vocabulary from training dataset's captions, please run
bash run.sh vocab. This createsvocab.json. - To create tags (concept features), please run
bash run.sh tags. This createstag_vocab.jsonandvideo_tag.txt. - To create word2vec embeddings, please run
bash run.sh word2vec. This createspretrained_weight.npyused for trianing.
- To train the model, please run
bash run.sh train_hybrid_cuda. (Training automatically includes evaluation at the end.) - To evaluate the model, please run
bash run.sh test_hybrid_cuda $MODEL_PATH. - For debug mode, please run
bash run.sh train_hybrid_cpu_debugafter creating captions for debug usingbash run.sh tiny.
You can check out commands in run.sh.
- Larger R@Ks and sum R, and smaller mean r represent better performance.
- [B] shows considerable performance improvement compared to Dual Encoding.
- Final optimized models [E] and [F] have smaller sizes and better overall performance than Dual Encoding.

