SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy
Shi Li, Vinkle Srivastav, Nicolas Chanel, Saurav Sharma, Nabani Banik, Lorenzo Arboit, Kun Yuan, Pietro Mascagni, Nicolas Padoy
University of Strasbourg / CNRS / INSERM, ICube UMR7357 · IHU Strasbourg
Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to navigate evolving intraoperative scenes. We propose SurgTEMP, a multimodal LLM framework for surgical video question answering, featuring:
- A Text-guided Memory Pyramid (TEMP) constructor that builds hierarchical spatial and temporal visual memory banks guided by the text query
- A Surgical Competency Progression (SCP) training scheme that progressively builds perception, assessment, and reasoning capabilities
We also introduce CholeVidQA-32K, a surgical video QA dataset comprising 32K open-ended QA pairs from 3,855 laparoscopic cholecystectomy segments (~128 h total), organized across 11 tasks spanning perception, assessment, and reasoning.
Code, dataset, and model weights will be released soon.
- Training and inference code
- Pre-trained model weights
- CholeVidQA-32K dataset
@misc{li2026surgtemptemporalawaresurgicalvideo,
title={SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy},
author={Shi Li and Vinkle Srivastav and Nicolas Chanel and Saurav Sharma and Nabani Banik and Lorenzo Arboit and Kun Yuan and Pietro Mascagni and Nicolas Padoy},
year={2026},
eprint={2603.29962},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.29962},
}This work was funded by the European Union (ERC, CompSURG, 101088553) and French state funds managed by the ANR under Grants ANR-10-IAHU-02, ANR-23-IACL-0004, ANR-10-IDEX-0002, and ANR-20-SFRI-0012, with HPC resources provided by CAMMA, IHU Strasbourg, and Unistra Mesocentre.