-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Hello, could you add our paper that trains a Video-LLM based on trajectory tokens, and just got accepted to ICCV 2025? We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs.
paper link: https://arxiv.org/abs/2505.23617
project page: https://raivnlab.github.io/trajvit/
code: https://github.com/RAIVNLab/trajvit
bibtex:
https://github.com/Article{zheng2025one,
title={One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory},
author={Zheng, Chenhao and Zhang, Jieyu and Salehi, Mohammadreza and Gao, Ziqi and Iyengar, Vishnu and Kobori, Norimasa and Kong, Quan and Krishna, Ranjay},
journal={arXiv preprint arXiv:2505.23617},
year={2025}
}