Zero Shot Natural Language Temporal Video Grounding.

📑 Paper · 🌎 Project Page · 💻 Training Code

Official PyTorch implementation of ResidualViT for Efficient Temporally Dense Video Encoding, accepted at ICCV 2025 (highlight paper).
This repository provides the testing code for NLTVG task.

🚀 Installation

This repository uses PyTorchLighting. It also uses Hydra to manage runs configurations. We have facilitated a conda environment for quick setup. Assuming conda is installed, run:

conda env create -f environment.yml
conda activate sm
pip install --no-dependencies git+https://github.com/Soldelli/residualvit
export PYTHONPATH="$PYTHONPATH:$PWD"

🎯 Zero Shot Evaluation

To evaluate a model in particular dataset, follow the template below:

python -m aligner command=evaluate encoder=$MODEL data=$DATASET output_dir=$OUTPUT_DIR

📊 Supported datasets

Dataset	Annotations	Videos
Charades-STA	Download	Website
ActivityNet-Captions	Download	Website

🤖 Supported encoders

OpenCLIP. This repository supports these available OpenCLIP models.
Available encoders: openclip_vit_b_32, openclip_vit_b_16, openclip_vit_l_14.
ResidualViT. See scripts for examples on how to use this encoder and visit the official ResidualViT codebase for the training code (here).

📂 Repository Structure

zs-video-eval/
├── aligner/          # Core source code
├── configs/          # Config files for encoders and datasets
├── scripts/          # Training scripts
├── environment.yml   # Python dependencies
├── LICENSE.md        # Project license
├── README.md         # Project documentation
└── ...

💡 Citation

If you use this code or find it helpful in your research, please cite our papers:

@inproceedings{soldan2025residualvit,
  title={ResidualViT for Efficient Temporally Dense Video Encoding},
  author={Soldan, Mattia and Caba Heilbron, Fabian and Ghanem, Bernard and Sivic, Josef and Russell, Bryan},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025}
}

@article{castro2022fitclip,
  title={Fitclip: Refining large-scale pretrained image-text models for zero-shot video understanding tasks},
  author={Castro, Santiago and Heilbron, Fabian Caba},
  journal={arXiv preprint arXiv:2203.13371},
  year={2022}
}

🙏 Acknowledgements

This repository is built on top of FitCLIP, thanks to our collaborators and open-source community.

📜 License

This project is licensed under the ADOBE RESEARCH LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zero Shot Natural Language Temporal Video Grounding.

🚀 Installation

🎯 Zero Shot Evaluation

📊 Supported datasets

🤖 Supported encoders

📂 Repository Structure

💡 Citation

🙏 Acknowledgements

📜 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
aligner		aligner
config		config
scripts		scripts
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml

License

adobe-research/zs-video-eval

Folders and files

Latest commit

History

Repository files navigation

Zero Shot Natural Language Temporal Video Grounding.

🚀 Installation

🎯 Zero Shot Evaluation

📊 Supported datasets

🤖 Supported encoders

📂 Repository Structure

💡 Citation

🙏 Acknowledgements

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages