Description
Describe the feature
Add the model described in "Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation" which is a new vision transformer backbone design for semantic segmentation. It has a multi-branch high-resolution (HR) architecture with enhanced multi-scale representability, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction on ADE20K and Cityscapes.
Motivation
Recent model that combines the features of HRNet and ViT, achieving good performance while reducing parameters and FLOPs.
Related resources
Official code can be found here.
Additional context
Their implementation already uses mmseg and mmcv, so should be quite straightforward to add support for it.