Skip to content

Conversation

@danielclough
Copy link
Contributor

Swin v1

This PR adds support for Swin Transformer, a hierarchical vision transformer that uses shifted windows for efficient self-attention computation.

Implementation Details

Model (candle-transformers/src/models/swin.rs):

  • Full Swin Transformer v1 architecture with hierarchical feature maps
  • Window-based multi-head self-attention with relative position bias
  • Shifted window mechanism for cross-window connections
  • Patch merging for downsampling between stages
  • Supports HuggingFace pretrained weights format (separate Q/K/V projections)
  • Linear computational complexity with respect to image size

Example (candle-examples/examples/swin/):

  • Command-line interface for ImageNet classification
  • Support for all official model variants:
    • Swin-Tiny (28M params, 224×224)
    • Swin-Small (50M params, 224×224)
    • Swin-Base (88M params, 224×224 and 384×384)
    • Swin-Large (197M params, 224×224 and 384×384)
  • Automatic weight downloading from HuggingFace Hub
  • CPU and GPU support

Usage

# Run with default Swin-Tiny model
cargo run --example swin --release -- --image strawberry.jpg

# Run with larger model variant
cargo run --example swin --release -- --image photo.jpg --which large-large

Testing

Tested with all model variants using weights from microsoft/swin-*-patch4-window* on HuggingFace Hub. Implementation matches HuggingFace's weight format for compatibility with official pretrained models.

Notes

  • Swin v2 support marked as TODO (not yet implemented)
  • Includes comprehensive documentation and README with usage examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant