Revisit SwinTransformer

We currently have two implementations of the Swin Transformer:

1. `SwinViTProcessor`, inspired by the Aurora implementation (#294)  
2. `SwinTVProcessor`, based on `torchvision` (#296)

At the moment, `SwinTVProcessor` is underperforming, so we are prioritising experiments with `SwinViTProcessor`. However, it would be worth revisiting to better understand whether the gap can be reduced.

## Potential causes and directions for improvement

- **No skip connections**  
  Likely the main issue. Dense prediction through a 32× compressed bottleneck without skip connections struggles to preserve spatial detail, which may explain the flat or increasing validation loss.

- **Conv-only decoder vs transformer decoder**  
  This likely compounds the first issue. Even with skip connections, a simple convolutional decoder is weaker than the Swin-based decoder used in `swin_vit.py`.

- **No gating (zero-init identity)**  
  In `swin_vit.py`, the model starts close to identity and gradually increases capacity. In contrast, `swin_tv_vit.py` uses fully active transformer blocks from the start, which may lead to noisier early training.

- **No noise conditioning in the decoder**  
  The decoder does not incorporate any conditioning signal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit SwinTransformer #302

Potential causes and directions for improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revisit SwinTransformer #302

Description

Potential causes and directions for improvement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions