Skip to content

Revisit SwinTransformer #302

@ContiPaolo

Description

@ContiPaolo

We currently have two implementations of the Swin Transformer:

  1. SwinViTProcessor, inspired by the Aurora implementation (Advanced ViTs #294)
  2. SwinTVProcessor, based on torchvision (add SwinTransformer #296)

At the moment, SwinTVProcessor is underperforming, so we are prioritising experiments with SwinViTProcessor. However, it would be worth revisiting to better understand whether the gap can be reduced.

Potential causes and directions for improvement

  • No skip connections
    Likely the main issue. Dense prediction through a 32× compressed bottleneck without skip connections struggles to preserve spatial detail, which may explain the flat or increasing validation loss.

  • Conv-only decoder vs transformer decoder
    This likely compounds the first issue. Even with skip connections, a simple convolutional decoder is weaker than the Swin-based decoder used in swin_vit.py.

  • No gating (zero-init identity)
    In swin_vit.py, the model starts close to identity and gradually increases capacity. In contrast, swin_tv_vit.py uses fully active transformer blocks from the start, which may lead to noisier early training.

  • No noise conditioning in the decoder
    The decoder does not incorporate any conditioning signal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions