On the Consistency of Video Large Language Models in Temporal Comprehension

News

[2025.03.25] Evaluation Codes have been released.
[2025.02.27] Our paper has been accepted by CVPR 2025! 🎉
[2025.01.15] We are excited to share that our evaluation datasets, Charades-CON and ActivityNet-CON, are now available on Hugging Face! 🎉 Additionally, the training annotations for VTune have also been released.
[2025.01.14] We have released our four checkpoints using VTune: VideoLLaMA-7B-Charades-VTune, VideoLLaMA-7B-ActvityNet-VTune, TimeChat-7B-Charades-VTune, TimeChat-7B-ActvityNet-VTune. Additionally, checkpoints with naive fine-tuning: VideoLLaMA-7B-Charades-FT, VideoLLaMA-7B-ActvityNet-FT, TimeChat-7B-ActivityNet-FT have been released.
[2024.11.20] Our paper has been released on arXiv.

Introduction

We study the model’s consistency in temporal comprehension by assessing whether its responses align with the initial grounding, using dedicated probes and datasets. We specifically focus on video temporal grounding, where the task involves identifying timestamps in a video that correspond to language queries.

Download

You can download the complete annotations for consistency evaluation from Hugging Face. The source videos are available via the following links:

Evaluation

Before starting the evaluation, make sure you have prepared the annotations and videos. You should also check the configuration of the Video-LLMs. Install the necessary dependencies using conda and pip for your model. Additionally, you may run utils/shift_video.py with the right paths to prepare shifted videos. Here, we provide an example with the model TimeChat. We will include additional baseline models in the future.

To run the evaluation, use the following command:

python run.py --model_type TimeChat --dset_name activitynet --task consistency

dset_name refers to the test dataset, which can be either charades or activitynet. task refers to the evaluation task: either consistency or grounding. If set to grounding, the evaluation will be performed on the original test set. You can also use the --debug flag before performing the actual evaluation to verify your configuration settings.

Once the evaluation is complete, the performance will be reported in consistency_eval_results.json, and you can check the model's output in consistency_predictions.jsonl.

Training

For VTune, please download the training annotations for each dataset from Hugging Face. The hyperparameters should align with those specified in Appendix Table 11.

Important: For VTune for Charades-STA, the hyperparameters iters_per_epochs and Warmup_steps should be 24811 and 14916, respectively. Please note that these values differ from those listed in Table 11. We apologize for any inconvenience caused.

For evaluation, please provide the checkpoints for each dataset using the links below:

Then, use the following command:

python run.py --model_type TimeChat --dset_name activitynet --fine_tuned --task consistency

Including the fine_tuned option will automatically switch the checkpoint path ckpt to activitynet_ckpt in timechat/eval_configs/timechat.yaml.

Citation

If you find our paper useful, please consider citing our paper.

@article{jung2024consistency,
  title={On the Consistency of Video Large Language Models in Temporal Comprehension},
  author={Jung, Minjoon and Xiao, Junbin and Zhang, Byoung-Tak and Yao, Angela},
  journal={arXiv preprint arXiv:2411.12951},
  year={2024}
}

Acknowledgement

We appreciate for the following awesome Video-LLMs:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

On the Consistency of Video Large Language Models in Temporal Comprehension

News

Introduction

Download

Evaluation

Training

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
eval		eval
scripts		scripts
task		task
timechat		timechat
utils		utils
.gitignore		.gitignore
README.md		README.md
run.py		run.py

minjoong507/Consistency-of-Video-LLM

Folders and files

Latest commit

History

Repository files navigation

On the Consistency of Video Large Language Models in Temporal Comprehension

News

Introduction

Download

Evaluation

Training

Citation

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages