You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, SwinTVProcessor is underperforming, so we are prioritising experiments with SwinViTProcessor. However, it would be worth revisiting to better understand whether the gap can be reduced.
Potential causes and directions for improvement
No skip connections
Likely the main issue. Dense prediction through a 32× compressed bottleneck without skip connections struggles to preserve spatial detail, which may explain the flat or increasing validation loss.
Conv-only decoder vs transformer decoder
This likely compounds the first issue. Even with skip connections, a simple convolutional decoder is weaker than the Swin-based decoder used in swin_vit.py.
No gating (zero-init identity)
In swin_vit.py, the model starts close to identity and gradually increases capacity. In contrast, swin_tv_vit.py uses fully active transformer blocks from the start, which may lead to noisier early training.
No noise conditioning in the decoder
The decoder does not incorporate any conditioning signal.
We currently have two implementations of the Swin Transformer:
SwinViTProcessor, inspired by the Aurora implementation (Advanced ViTs #294)SwinTVProcessor, based ontorchvision(add SwinTransformer #296)At the moment,
SwinTVProcessoris underperforming, so we are prioritising experiments withSwinViTProcessor. However, it would be worth revisiting to better understand whether the gap can be reduced.Potential causes and directions for improvement
No skip connections
Likely the main issue. Dense prediction through a 32× compressed bottleneck without skip connections struggles to preserve spatial detail, which may explain the flat or increasing validation loss.
Conv-only decoder vs transformer decoder
This likely compounds the first issue. Even with skip connections, a simple convolutional decoder is weaker than the Swin-based decoder used in
swin_vit.py.No gating (zero-init identity)
In
swin_vit.py, the model starts close to identity and gradually increases capacity. In contrast,swin_tv_vit.pyuses fully active transformer blocks from the start, which may lead to noisier early training.No noise conditioning in the decoder
The decoder does not incorporate any conditioning signal.