Skip to content

[CVPR 2025] Official Repository of the paper "On the Consistency of Video Large Language Models in Temporal Comprehension"

Notifications You must be signed in to change notification settings

minjoong507/Consistency-of-Video-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

On the Consistency of Video Large Language Models in Temporal Comprehension

arXiv

News

Introduction

image

  • We study the model’s consistency in temporal comprehension by assessing whether its responses align with the initial grounding, using dedicated probes and datasets. We specifically focus on video temporal grounding, where the task involves identifying timestamps in a video that correspond to language queries.

Download

You can download the complete annotations for consistency evaluation from Hugging Face. The source videos are available via the following links:

Evaluation

Before starting the evaluation, make sure you have prepared the annotations and videos. You should also check the configuration of the Video-LLMs. Install the necessary dependencies using conda and pip for your model. Additionally, you may run utils/shift_video.py with the right paths to prepare shifted videos. Here, we provide an example with the model TimeChat. We will include additional baseline models in the future.

To run the evaluation, use the following command:

python run.py --model_type TimeChat --dset_name activitynet --task consistency

dset_name refers to the test dataset, which can be either charades or activitynet. task refers to the evaluation task: either consistency or grounding. If set to grounding, the evaluation will be performed on the original test set. You can also use the --debug flag before performing the actual evaluation to verify your configuration settings.

Once the evaluation is complete, the performance will be reported in consistency_eval_results.json, and you can check the model's output in consistency_predictions.jsonl.

Training

For VTune, please download the training annotations for each dataset from Hugging Face. The hyperparameters should align with those specified in Appendix Table 11.

Important: For VTune for Charades-STA, the hyperparameters iters_per_epochs and Warmup_steps should be 24811 and 14916, respectively. Please note that these values differ from those listed in Table 11. We apologize for any inconvenience caused.

For evaluation, please provide the checkpoints for each dataset using the links below:

Then, use the following command:

python run.py --model_type TimeChat --dset_name activitynet --fine_tuned --task consistency

Including the fine_tuned option will automatically switch the checkpoint path ckpt to activitynet_ckpt in timechat/eval_configs/timechat.yaml.

Citation

If you find our paper useful, please consider citing our paper.

@article{jung2024consistency,
  title={On the Consistency of Video Large Language Models in Temporal Comprehension},
  author={Jung, Minjoon and Xiao, Junbin and Zhang, Byoung-Tak and Yao, Angela},
  journal={arXiv preprint arXiv:2411.12951},
  year={2024}
}

Acknowledgement

We appreciate for the following awesome Video-LLMs:

About

[CVPR 2025] Official Repository of the paper "On the Consistency of Video Large Language Models in Temporal Comprehension"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published