From left to right: RGB input and predictions of depth, segmentation, edge, and normal.
- Nov 2025: Paper released on arXiv. Code is coming soon.
Current approaches in multi-task learning (MTL) mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness and inconsistent predictions across views (See (a) and (c) top). We argue that 3D-awareness is vital for modeling cross-task correlations, and propose to address this problem by integrating correlations across views as geometric consistency in the MTL network (See (b) and (c) bottom).
As shown in above figure, given an image, we feed it and its neighbour view into the MTL encoder and extract the MTL features. In parallel, our lightweight cross-view module (CvM) takes as input both views. In CvM, a spatial-aware encoder encodes geometric-biased features, followed by a multi-view transformer that enables information exchanged across views and outputs cross-view features. A cost volume module then converts the cross-view features to the cost volume by warping the feature from one view to another given their relative camera parameters and matching features across views. Finally, both cost volume and cross-view feature are concatenated with the MTL features, forming the geometric-aware MTL feature for estimating predictions of multiple dense vision tasks. This module is architecture-agnostic and can be applied to both single and multi-view data. The overall framework is illustrated in the above figure. We show that this module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.
For any question, you can contact Wei-Hong Li or Xiaoye Wang.
We would like to thanks authors of DepthSplat, MVSplat, UniMatch, SAK for their source code.
@article{wang3dawaremtl,
title={3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding},
author={Xiaoye Wang and Chen Tang and Xiangyu Yue and Wei-Hong Li},
year={2025},
eprint={2511.20646},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2511.20646},
}
