Skip to content

Latest commit

 

History

History
380 lines (289 loc) · 15.5 KB

README.md

File metadata and controls

380 lines (289 loc) · 15.5 KB

ECCV-2024-Oral

2D Scene Understanding

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models

Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities

OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

ESA: Annotation-Efficient Active Learning for Semantic Segmentation

Towards Scene Graph Anticipation

Dataset Enhancement with Instance-Level Augmentations

An Adaptive Correspondence Scoring Framework for Unsupervised Image Registration of Medical Images

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Towards Open-ended Visual Quality Comparison

CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model

A Fair Ranking and New Model for Panoptic Scene Graph Generation

Parrot Captions Teach CLIP to Spot Text

On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

SINDER: Repairing the Singular Defects of DINOv2

Emergent Visual-Semantic Hierarchies in Image-Text Representations

AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation

3D Scene Understanding

OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

PointLLM: Empowering Large Language Models to Understand Point Clouds

Bi-directional Contextual Attention for 3D Dense Captioning

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Scene Coordinate Reconstruction

HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation

RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentatio

RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation

Grounding Image Matching in 3D with MASt3R

Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration

SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

NeRF / Gaussian

Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering

MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images

Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields

RaFE: Generative Radiance Fields Restoration

Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration

FisherRF: Active View Selection and Uncertainty Quantification for Radiance Fields using Fisher Information

2D Generation

Adversarial Diffusion Distillation

Adversarial Robustification via Text-to-Image Diffusion Models

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Accelerating Image Generation with Sub-path Linear Approximation Model

LLMGA: Multimodal Large Language Model based Generation Assistant

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

3D Generation

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

FlashTex: Fast Relightable Mesh Texturing with LightControlNet

Pyramid Diffusion for Fine 3D Large Scene Generation

COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation

A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures

Human

TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation

Controllable Human-Object Interaction Synthesis

Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Sapiens: Foundation for Human Vision Models

Arc2Face: A Foundation Model for ID-Consistent Human Faces

Video

PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

Audio-Synchronized Visual Animation

LongVLM: Efficient Long Video Understanding via Large Language Models

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

E3M: Zero-Shot Spatio-Temporal Video Grounding

Classification Matters: Improving Video Action Detection with Class-Specific Attention

Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets

ActionVOS: Actions as Prompts for Video Object Segmentation

DEVIAS: Learning Disentangled Video Representations of Action and Scene

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

Video Editing via Factorized Diffusion Distillation

Towards Neuro-Symbolic Video Understanding

LLM / MLLM / VLM

MMBench: Is Your Multi-modal Model an All-around Player?

BRAVE: Broadening the visual encoding of vision-language models

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models

Towards Goal-oriented Large Language Model Prompting: A Survey

Transformer

Denoising Vision Transformers

Diffusion

Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models