[RFC] U-Net framework

### 🚀 The feature

A module-based approach of building **U-Nets** inside torchvision, similar to torchmultimodals sub-network approach. Mostly a food for though experiment given the similar nature of other popular vision Architectural Frameworks (namely [DETR](https://arxiv.org/abs/2005.12872), [Deformable-DETR](https://arxiv.org/abs/2010.04159), [Mask2Former](https://arxiv.org/abs/2112.01527) or [Mask DINO](https://arxiv.org/abs/2206.02777) which are just a composition of sub-networks that can be adapted to most vision tasks)

### Motivation, pitch

U-Nets are a good example given the rising popularity of Diffusion Models in which the U-Net paradigm is used (layers or merge strategies being the main difference between most implementations). 

Unlike DETR or Mask2Former, which can be broken done quite simply into a module with 2-4 sub-modules followed by a task head, the U-Net framework presents some more intricate challenges at configuration specification level given that we have to sync cross-level encoder and decoder configurations.

Shifting towards a more Framework / Block based approach for larger architectures (think of `nn.Modules` like experience but for present-time vision architectures) would be beneficial for users when it comes to sharing code, improvements or simply swapping out backbones or different components. For instance, if someone would want to grab a Mask2Former they would have to go and integrate themselves into [Detectron2](https://github.com/facebookresearch/Mask2Former/blob/main/INSTALL.md).

Similarly, if someone would want to jump in into doing diffusion, they would first have to find or make their own U-Net implementation even though what they most-likely want to do is to add a bottleneck with attention or some residual connection somewhere throughout the network or simply a different normalization layer in comparison to the original paper.

These classic paradigms should be easy to configure or specify (the same way `torch.nn.Transformer`  handles the transformer), and if more severe changes are wanted a user can have access to a code-base which they can copy-paste and apply minimal modification to (similar to how DETR [handles positional embeddings](https://github.com/facebookresearch/detr/blob/8a144f83a287f4d3fece4acdf073f387c5af387d/models/transformer.py#L154) in the decoder).

Even if they opt for modifying to code-base and later on share their work, there is the bonus of familiarity when others might want to work on top of their code since it's not entirely different than the base version with which they are already familiar with.

Supporting some of these architectures or frame-works might attract users that are working on tasks that are currently not supported by torchvision (for instance [Monocular Depth Estimation SotA](https://arxiv.org/abs/2204.00987v1) makes use of Mask2Former) which might provide us with valuable insight about the needs of the larger vision community.

### Alternatives

Currently [Lucidrains](https://github.com/lucidrains) has been leading these kinds of efforts for Attention Operations and Transformers and more recently for Diffusion models.

### Additional context

_No response_

cc @datumbox

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] U-Net framework #6610

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] U-Net framework #6610

Description

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions