Skip to content
Change the repository type filter

All

    Repositories list

    • GenExam

      Public
      GenExam: A Multidisciplinary Text-to-Image Exam
      Python
      34600Updated Nov 24, 2025Nov 24, 2025
    • VideoChat-Flash

      Public
      VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
      Python
      14485100Updated Nov 18, 2025Nov 18, 2025
    • SDLM

      Public
      Sequential Diffusion Language Model (SDLM) enhances pre-trained autoregressive language models by adaptively determining generation length and maintaining KV-cache compatibility, achieving high efficiency and throughput.
      Python
      17400Updated Nov 17, 2025Nov 17, 2025
    • Vlaser

      Public
      Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
      Python
      03220Updated Nov 7, 2025Nov 7, 2025
    • MetaCaptioner

      Public
      Python
      33910Updated Oct 31, 2025Oct 31, 2025
    • SID-VLN

      Public
      Official implementation of: Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale
      Python
      2900Updated Oct 29, 2025Oct 29, 2025
    • ExpVid

      Public
      0600Updated Oct 28, 2025Oct 28, 2025
    • VideoChat-R1

      Public
      [NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
      Python
      9229200Updated Oct 18, 2025Oct 18, 2025
    • NaViL

      Public
      Python
      78500Updated Oct 10, 2025Oct 10, 2025
    • ScaleCUA

      Public
      ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).
      Python
      5190070Updated Oct 3, 2025Oct 3, 2025
    • PonderV2

      Public
      [T-PAMI 2025] PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
      Python
      936300Updated Sep 30, 2025Sep 30, 2025
    • InternVL

      Public
      [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
      Python
      7379.5k2795Updated Sep 22, 2025Sep 22, 2025
    • EgoExoLearn

      Public
      [CVPR 2024] Data and benchmark code for the EgoExoLearn dataset
      Python
      27330Updated Aug 26, 2025Aug 26, 2025
    • VRBench

      Public
      [ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos
      Python
      02100Updated Aug 8, 2025Aug 8, 2025
    • InternVideo

      Public
      [ECCV2024] Video Foundation Models & Data for Multimodal Understanding
      Python
      1322.1k1323Updated Aug 7, 2025Aug 7, 2025
    • PIIP

      Public
      [NeurIPS 2024 Spotlight ⭐️ & TPAMI 2025] Parameter-Inverted Image Pyramid Networks (PIIP)
      Python
      510520Updated Aug 5, 2025Aug 5, 2025
    • GUI-Odyssey

      Public
      [ICCV 2025] GUIOdyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUIOdyssey consists of 8,834 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 212 apps, and 1.4K app combos.
      Python
      813490Updated Aug 4, 2025Aug 4, 2025
    • LORIS

      Public
      [ICML2023] Long-Term Rhythmic Video Soundtracker
      Python
      16110Updated Jul 28, 2025Jul 28, 2025
    • TPO

      Public
      Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
      Jupyter Notebook
      26210Updated Jul 22, 2025Jul 22, 2025
    • Docopilot

      Public
      [CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
      Python
      13520Updated Jul 22, 2025Jul 22, 2025
    • Mono-InternVL

      Public
      [CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
      Python
      09460Updated Jul 18, 2025Jul 18, 2025
    • ZeroGUI

      Public
      ZeroGUI: Automating Online GUI Learning at Zero Human Cost
      Python
      710100Updated Jul 17, 2025Jul 17, 2025
    • MUTR

      Public
      「AAAI 2024」 Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation
      Python
      78230Updated Jun 13, 2025Jun 13, 2025
    • PVC

      Public
      [CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
      Python
      15040Updated Jun 12, 2025Jun 12, 2025
    • FluxViT

      Public
      Make Your Training Flexible: Towards Deployment-Efficient Video Models
      Python
      03410Updated Jun 11, 2025Jun 11, 2025
    • VeBrain

      Public
      Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
      78640Updated Jun 6, 2025Jun 6, 2025
    • [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
      Python
      23313100Updated May 22, 2025May 22, 2025
    • OmniQuant

      Public
      [ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
      Python
      72875261Updated May 22, 2025May 22, 2025
    • EgoVideo

      Public
      [CVPR 2024 Champions][ICLR 2025] Solutions for EgoVis Chanllenges in CVPR 2024
      Jupyter Notebook
      413290Updated May 11, 2025May 11, 2025
    • OmniCorpus

      Public
      [ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
      Python
      740400Updated May 5, 2025May 5, 2025