- [2025.03.25] Evaluation Codes have been released.
- [2025.02.27] Our paper has been accepted by CVPR 2025! 🎉
- [2025.01.15] We are excited to share that our evaluation datasets, Charades-CON and ActivityNet-CON, are now available on Hugging Face! 🎉 Additionally, the training annotations for VTune have also been released.
- [2025.01.14] We have released our four checkpoints using VTune: VideoLLaMA-7B-Charades-VTune, VideoLLaMA-7B-ActvityNet-VTune, TimeChat-7B-Charades-VTune, TimeChat-7B-ActvityNet-VTune. Additionally, checkpoints with naive fine-tuning: VideoLLaMA-7B-Charades-FT, VideoLLaMA-7B-ActvityNet-FT, TimeChat-7B-ActivityNet-FT have been released.
- [2024.11.20] Our paper has been released on arXiv.
- We study the model’s consistency in temporal comprehension by assessing whether its responses align with the initial grounding, using dedicated probes and datasets. We specifically focus on video temporal grounding, where the task involves identifying timestamps in a video that correspond to language queries.
You can download the complete annotations for consistency evaluation from Hugging Face. The source videos are available via the following links:
Before starting the evaluation, make sure you have prepared the annotations and videos. You should also check the configuration of the Video-LLMs. Install the necessary dependencies using conda and pip for your model. Additionally, you may run utils/shift_video.py
with the right paths to prepare shifted videos.
Here, we provide an example with the model TimeChat
. We will include additional baseline models in the future.
To run the evaluation, use the following command:
python run.py --model_type TimeChat --dset_name activitynet --task consistency
dset_name
refers to the test dataset, which can be either charades
or activitynet
. task
refers to the evaluation task: either consistency
or grounding
. If set to grounding
, the evaluation will be performed on the original test set.
You can also use the --debug
flag before performing the actual evaluation to verify your configuration settings.
Once the evaluation is complete, the performance will be reported in consistency_eval_results.json
, and you can check the model's output in consistency_predictions.jsonl
.
For VTune, please download the training annotations for each dataset from Hugging Face. The hyperparameters should align with those specified in Appendix Table 11.
Important: For VTune for Charades-STA, the hyperparameters iters_per_epochs
and Warmup_steps
should be 24811
and 14916
, respectively. Please note that these values differ from those listed in Table 11. We apologize for any inconvenience caused.
For evaluation, please provide the checkpoints for each dataset using the links below:
Then, use the following command:
python run.py --model_type TimeChat --dset_name activitynet --fine_tuned --task consistency
Including the fine_tuned
option will automatically switch the checkpoint path ckpt
to activitynet_ckpt
in timechat/eval_configs/timechat.yaml
.
If you find our paper useful, please consider citing our paper.
@article{jung2024consistency,
title={On the Consistency of Video Large Language Models in Temporal Comprehension},
author={Jung, Minjoon and Xiao, Junbin and Zhang, Byoung-Tak and Yao, Angela},
journal={arXiv preprint arXiv:2411.12951},
year={2024}
}
We appreciate for the following awesome Video-LLMs: