Skip to content

Could you add a ICCV 2025 paper that trains a Video-LLM based on trajectory tokens? #255

@hellomuffin

Description

@hellomuffin

Hello, could you add our paper that trains a Video-LLM based on trajectory tokens, and just got accepted to ICCV 2025? We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs.

paper link: https://arxiv.org/abs/2505.23617
project page: https://raivnlab.github.io/trajvit/
code: https://github.com/RAIVNLab/trajvit
bibtex:
https://github.com/Article{zheng2025one,
title={One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory},
author={Zheng, Chenhao and Zhang, Jieyu and Salehi, Mohammadreza and Gao, Ziqi and Iyengar, Vishnu and Kobori, Norimasa and Kong, Quan and Krishna, Ranjay},
journal={arXiv preprint arXiv:2505.23617},
year={2025}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions