CVPR 2025 decisions are now available on OpenReview!22.1% = 2878 / 13008
注1:欢迎各位大佬提交issue,分享CVPR 2025论文和开源项目!
注2:关于往年CV顶会论文以及其他优质CV论文和大盘点,详见: https://github.com/amusi/daily-paper-computer-vision
欢迎扫码加入【CVer学术交流群】,可以获取CVPR 2025等最前沿工作!这是最大的计算机视觉AI知识星球!每日更新,第一时间分享最新最前沿的计算机视觉、AIGC、扩散模型、多模态、深度学习、自动驾驶、医疗影像和遥感等方向的学习资料,快加入学起来!
- 3DGS(Gaussian Splatting)
 - Agent)
 - Avatars
 - Backbone
 - CLIPEVOS
 - Mamba
 - Embodied AI
 - GAN
 - GNN
 - 多模态大语言模型(MLLM)
 - 大语言模型(LLM)
 - NAS
 - OCR
 - NeRF
 - DETR
 - 扩散模型(Diffusion Models)
 - ReID(重识别)
 - 长尾分布(Long-Tail)
 - Vision Transformer
 - 视觉和语言(Vision-Language)
 - 自监督学习(Self-supervised Learning)
 - 数据增强(Data Augmentation)
 - 目标检测(Object Detection)
 - 异常检测(Anomaly Detection)
 - 目标跟踪(Visual Tracking)
 - 语义分割(Semantic Segmentation)
 - 实例分割(Instance Segmentation)
 - 全景分割(Panoptic Segmentation)
 - 医学图像(Medical Image)
 - 医学图像分割(Medical Image Segmentation)
 - 视频目标分割(Video Object Segmentation)
 - 视频实例分割(Video Instance Segmentation)
 - 参考图像分割(Referring Image Segmentation)
 - 图像抠图(Image Matting)
 - 图像编辑(Image Editing)
 - Low-level Vision
 - 超分辨率(Super-Resolution)
 - 去噪(Denoising)
 - 去模糊(Deblur)
 - 自动驾驶(Autonomous Driving)
 - 3D点云(3D Point Cloud)
 - 3D目标检测(3D Object Detection)
 - 3D语义分割(3D Semantic Segmentation)
 - 3D目标跟踪(3D Object Tracking)
 - 3D语义场景补全(3D Semantic Scene Completion)
 - 3D配准(3D Registration)
 - 3D人体姿态估计(3D Human Pose Estimation)
 - 3D人体Mesh估计(3D Human Mesh Estimation)
 - 3D Visual Grounding(3D视觉定位)
 - 医学图像(Medical Image)
 - 图像生成(Image Generation)
 - 视频生成(Video Generation)
 - 3D生成(3D Generation)
 - 视频理解(Video Understanding)
 - 行为检测(Action Detection)
 - 具身智能(Embodied AI)
 - 文本检测(Text Detection)
 - 知识蒸馏(Knowledge Distillation)
 - 模型剪枝(Model Pruning)
 - 图像压缩(Image Compression)
 - 三维重建(3D Reconstruction)
 - 深度估计(Depth Estimation)
 - 轨迹预测(Trajectory Prediction)
 - 车道线检测(Lane Detection)
 - 图像描述(Image Captioning)
 - 视觉问答(Visual Question Answering)
 - 手语识别(Sign Language Recognition)
 - 视频预测(Video Prediction)
 - 新视点合成(Novel View Synthesis)
 - Zero-Shot Learning(零样本学习)
 - 立体匹配(Stereo Matching)
 - 特征匹配(Feature Matching)
 - 暗光图像增强(Low-light Image Enhancement)
 - 场景图生成(Scene Graph Generation)
 - 风格迁移(Style Transfer)
 - 隐式神经表示(Implicit Neural Representations)
 - 图像质量评价(Image Quality Assessment)
 - 视频质量评价(Video Quality Assessment)
 - 压缩感知(Compressive Sensing)
 - 数据集(Datasets)
 - 新任务(New Tasks)
 - 其他(Others)
 
SpiritSight Agent: Advanced GUI Agent with One Look
Building Vision Models upon Heat Conduction
LSNet: See Large, Focus Small
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network
MambaIC: State Space Models for High-Performance Learned Image Compression
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
- Project: https://ai4ce.github.io/CityWalker/
 - Paper: https://arxiv.org/abs/2411.17820
 - Code: https://github.com/ai4ce/CityWalker
 
Mr. DETR: Instructive Multi-Route Training for Detection Transformers
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution
Retrieval-Augmented Personalization for Multimodal Large Language Models
- Project Page: https://hoar012.github.io/RAP-Project/
 - Paper: https://arxiv.org/abs/2410.13360
 - Code: https://github.com/Hoar012/RAP-MLLM
 
BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
MMRL: Multi-Modal Representation Learning for Vision-Language Models
PAVE: Patching and Adapting Video Large Language Models
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization
AirRoom: Objects Matter in Room Reidentification
- Project: https://sairlab.org/airroom/
 - Paper: https://arxiv.org/abs/2503.01130
 
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
TinyFusion: Diffusion Transformers Learned Shallow
DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture
Tiled Diffusion
- Homepage: https://madaror.github.io/tiled-diffusion.github.io/
 - Paper: https://arxiv.org/abs/2412.15185
 - Code: https://github.com/madaror/tiled-diffusion
 
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
MMRL: Multi-Modal Representation Learning for Vision-Language Models
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Mr. DETR: Instructive Multi-Route Training for Detection Transformers
Multiple Object Tracking as ID Prediction
Omnidirectional Multi-Object Tracking
BrainMVP: Multi-modal Vision Pre-training for Medical Image Analysis
Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation
LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
- Project: https://ldkong.com/LiMoE
 - Paper: https://arxiv.org/abs/2501.04004
 - Code: https://github.com/Xiangxu-0103/LiMoE
 
Unlocking Generalization Power in LiDAR Point Cloud Registration
AESOP: Auto-Encoded Supervision for Perceptual Image Super-Resolution
- Paper: https://arxiv.org/abs/2412.00124
 - Code: https://github.com/2minkyulee/AESOP-Auto-Encoded-Supervision-for-Perceptual-Image-Super-Resolution
 
Reconstructing Humans with a Biomechanically Accurate Skeleton
#3D Visual Grounding(3D视觉定位)
ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
- Homepage: https://byteflow-ai.github.io/TokenFlow/
 - Code: https://github.com/ByteFlow-AI/TokenFlow
 - Paper:https://arxiv.org/abs/2412.03069
 
PAR: Parallelized Autoregressive Visual Generation
- Project: https://epiphqny.github.io/PAR-project/
 - Paper: https://arxiv.org/abs/2412.15119
 - Code: https://github.com/Epiphqny/PAR
 
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
- Project: https://generative-photography.github.io/project/
 - Paper: https://arxiv.org/abs/2412.02168
 - Code: https://github.com/pandayuanyu/generative-photography
 
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
- Project Page: https://opening-benchmark.github.io/
 - Paper: https://arxiv.org/abs/2411.18499).
 - Code: https://github.com/LanceZPF/OpenING
 
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
X-Dyna: Expressive Dynamic Human Image Animation
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
- Project: https://liewfeng.github.io/TeaCache/
 - Paper: https://arxiv.org/abs/2411.19108
 - Code: https://github.com/ali-vilab/TeaCache
 
AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
- Project: https://iva-mzsun.github.io/AR-Diffusion
 - Paper: https://arxiv.org/abs/2503.07418
 - Code: https://github.com/iva-mzsun/AR-Diffusion
 
Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing
h-Edit: Effective and Flexible Diffusion-Based Editing via Doob’s h-Transform
Generative Gaussian Splatting for Unbounded 3D City Generation
- Project: https://haozhexie.com/project/gaussian-city
 - Paper: https://arxiv.org/abs/2406.06526
 - Code: https://github.com/hzxie/GaussianCity
 
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
- Project: https://stdgen.github.io/
 - Paper: https://arxiv.org/abs/2411.05738
 - Code: https://github.com/hyz317/StdGEN
 
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
- Project: https://fast3r-3d.github.io/
 - Paper: https://arxiv.org/abs/2501.13928
 
SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
- Project: https://4dvlab.github.io/project_page/semgeomo/
 - Paper: https://arxiv.org/abs/2503.01291
 - https://github.com/4DVLab/SemGeoMo
 
Temporal Grounding Videos like Flipping Manga
Universal Actions for Enhanced Embodied Foundation Models
- Project: https://2toinf.github.io/UniAct/
 - Paper: https://arxiv.org/abs/2501.10105
 - Code: https://github.com/2toinf/UniAct
 
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
- Project: https://depthcrafter.github.io
 - Paper: https://arxiv.org/abs/2409.02095
 - Code: https://github.com/Tencent/DepthCrafter
 
MonSter: Marry Monodepth to Stereo Unleashes Power
DEFOM-Stereo: Depth Foundation Model Based Stereo Matching
- Project: https://insta360-research-team.github.io/DEFOM-Stereo/
 - Paper: https://arxiv.org/abs/2501.09466
 - Code: https://github.com/Insta360-Research-Team/DEFOM-Stereo
 
MonSter: Marry Monodepth to Stereo Unleashes Power
HVI: A New color space for Low-light Image Enhancement
- Paper: https://arxiv.org/abs/2502.20272
 - Code: https://github.com/Fediory/HVI-CIDNet
 - Demo: https://huggingface.co/spaces/Fediory/HVI-CIDNet_Low-light-Image-Enhancement_
 
ReDDiT: Efficient Diffusion as Low Light Enhancer
MambaIC: State Space Models for High-Performance Learned Image Compression
StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements
- Project: https://stylestudio-official.github.io/
 - Paper: https://arxiv.org/abs/2412.08503
 - Code: https://github.com/Westlake-AGI-Lab/StyleStudio
 
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
- Homepage: https://yichengchen24.github.io/projects/autocherrypicker
 - Paper: https://arxiv.org/pdf/2406.20085
 - Code: https://github.com/yichengchen24/ACP
 
Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing
Objaverse++: Curated 3D Object Dataset with Quality Annotations
DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry
Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation
EVOS: Efficient Implicit Neural Training via EVOlutionary Selector
