Skip to content

TEMPURA enables video-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of untrimmed videos.

Notifications You must be signed in to change notification settings

Andy-Cheng/TEMPURA

Repository files navigation

TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

Project Page | arXiv Preprint | VER Dataset

TEMPURA enables video-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of untrimmed videos.

TEMPURA Teaser

Installation

bash scripts/install/install.sh

VER Dataset Preparation

Please refer to VER Dataset on Hugging Face for dataset downloading.

Place the downloaded JSON files in data/VER/jsons/, and store the processed video frames (from YT-Temporal-1B) in the appropriate directory (data/yt1b/processed_frames). Our experiments used 1 FPS sampling, but you may adjust the frame rate based on your specific needs.

Model Weights

TEMPURA-Qwen2.5-VL-3B-s1

TEMPURA-Qwen2.5-VL-3B-s2

Inference

An example to generate timestamp-aligned captions.

python src/inference/dense_video_captioning.py

Demo APP

python src/serve/app.py

Training

By default, we use DeepSpeed ZeRO-2. You may switch to ZeRO-3 to further reduce GPU memory usage.

Masked Event Prediction

bash scripts/train/masked_event_prediction_3B_8H100.sh

Video Event Segmentation and Temporal Dense Captioning

You can swap the MODEL_PATH for the model weight from the masked event prediction training stage.

bash scripts/train/dense_event_caption_3B_8H100.sh
Training arguments
  • --deepspeed (str): Path to DeepSpeed config file (default: "scripts/zero2.json").
  • --data_path (str): Path to the LLaVA formatted training data (a JSON file). (Required)
  • --image_folder (str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)
  • --model_id (str): Path to the Qwen2-VL model. (Required)
  • --output_dir (str): Output directory for model checkpoints
  • --num_train_epochs (int): Number of training epochs (default: 1).
  • --per_device_train_batch_size (int): Training batch size per GPU per forwarding step.
  • --gradient_accumulation_steps (int): Gradient accumulation steps (default: 4).
  • --freeze_vision_tower (bool): Option to freeze vision_model (default: False).
  • --freeze_llm (bool): Option to freeze LLM (default: False).
  • --tune_merger (bool): Option to tune projector (default: True).
  • --num_lora_modules (int): Number of target modules to add LoRA (-1 means all layers).
  • --vision_lr (float): Learning rate for vision_model.
  • --merger_lr (float): Learning rate for merger(projector).
  • --learning_rate (float): Learning rate for language module.
  • --bf16 (bool): Option for using bfloat16.
  • --fp16 (bool): Option for using fp16.
  • --image_min_pixels (int): Option for minimum input pixels for image.
  • --image_max_pixles (int): Option for maximum maxmimum pixels for image.
  • --video_min_pixels (int): Option for minimum input pixels for video.
  • --video_max_pixles (int): Option for maximum maxmimum pixels for video.
  • --lora_enable (bool): Option for using LoRA.
  • --vision_lora (bool): Option for including vision_tower in LoRA module. lora_enable should be True to use this option.
  • --use_dora (bool): Option for using DoRA instead of LoRA. lora_enable should be True to use this option.
  • --lora_namespan_exclude (str): Exclude modules with namespans to add LoRA.
  • --max_seq_length (int): Maximum sequence length (default: 32K).
  • --bits (int): Quantization bits (default: 16).
  • --disable_flash_attn2 (bool): Disable Flash Attention 2.
  • --report_to (str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').
  • --logging_dir (str): Logging directory (default: "./tf-logs").
  • --lora_rank (int): LoRA rank (default: 128).
  • --lora_alpha (int): LoRA alpha (default: 256).
  • --lora_dropout (float): LoRA dropout (default: 0.05).
  • --logging_steps (int): Logging steps (default: 1).
  • --dataloader_num_workers (int): Number of data loader workers (default: 4).

Note: The learning rate of vision_model should be 10x ~ 5x smaller than the language_model.

Codebase Supported Features

  • Deepspeed
  • LoRA/QLoRA
  • Full-finetuning
  • Enable finetuning vision_model while using LoRA.
  • Disable/enable Flash Attention 2
  • Multi-image and video training
  • Training optimized with liger kernel

Citing TEMPURA

If you find our paper or dataset useful, please consider citing our work!

@article{tempura,
       title={TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action}, 
       author={Jen-Hao Cheng and Vivian Wang and Huayu Wang and Huapeng Zhou and Yi-Hao Peng and Hou-I Liu
              and Hsiang-Wei Huang and Kuang-Ming Chen and Cheng-Yen Yang
              and Wenhao Chai and Yi-Ling Chen and Vibhav Vineet and Qin Cai and Jenq-Neng Hwang},
       journal={arXiv preprint arXiv:2505.01583},
       year={2025}
}

Acknowledgement

We build upon the following repositories:

  • Qwen2-VL-Finetune: An amazing open-source project of Qwen2-VL and Qwen2.5-VL finetuning.
  • LLaVA-NeXT: An amazing open-source project of LMM.
  • Qwen2.5-VL: Awesome pretrained MLLM based on Qwen2.5-VL.
  • Liger-Kernel: Collection of Tirton kernels designed specifically for LLM training.

About

TEMPURA enables video-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of untrimmed videos.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •