GitHub - kshitijrajsharma/swin-mask2former-drone: Building segmentation on high resolution images using SWIN VIT 2 and mask2former

Overview

This document outlines an experimental architecture exploration for high res images mainly for buildnig segmentation. I started with from the RAMP (Replicable AI for Microplanning) 2020 Efficient-U-Net model. The original model represents excellent work by the RAMP team and has proven effective across diverse contexts. However, I've identified opportunities to enhance performance specifically in dense urban settlements by adopting recent advances in instance segmentation.

See Model Architecture Detail here

Motivation

The current U-Net architecture encounters three specific technical challenges in dense informal settlements:

The "Blob" Effect: Touching buildings merge into single semantic blobs due to pixel-connectivity rather than instance-level reasoning.
Overfitting on Limited Data: Fine-tuning on small datasets (500–1000 chips) leads to poor generalization.
Irregular Geometries: Predictions lack sharp corners, producing shapes unsuitable for GIS vectorization.

This experiment proposes a composable architecture combining modern feature extraction with instance-based prediction.

graph TD
    A["Input: RGB Aerial Image 256x256 (TorchGeo)"] --> B["Backbone: Swin Transformer (Base)"]
    B -->|"Features at 1/4, 1/8, 1/16, 1/32 scale"| C["Pixel Decoder: Multi-Scale Deformable Attention"]
    C --> D["Transformer Decoder: Mask2Former"]
    D -->|"Query 1"| E["Mask 1: Building A"]
    D -->|"Query 2"| F["Mask 2: Building B"]
    D -->|"Query N"| G["Mask N: Building C"]

    style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style D fill:#fff9c4,stroke:#fbc02d,stroke-width:2px

Architecture Design

Component Selection

Component	Selection	Justification
Backbone	Swin Transformer (Base)	Hierarchical vision transformer with window-based self-attention (256×256). Captures texture and geometry better than CNNs. Pretrained on ImageNet-22K.
Head	Mask2Former	Uses 100 learnable queries to predict instance masks. Treats each building as a separate object, solving the blob problem.
Adapter	LoRA (Rank = 16)	Planned for Stage 2 fine-tuning. Enables geographic adaptation without catastrophic forgetting.

Masked-attention Mask Transformer for Universal Image Segmentation :

Training Strategy: Two-Stage Curriculum

We use curriculum learning to leverage large-scale RAMP data while preventing overfitting on target sites.

Stage 1: Foundation Training (Current Implementation)

Goal: Teach semantic distinction between roof and ground in informal settlements.

Data: Currently using Banepa dataset (train/val/test splits)
Method: Full fine-tuning (Backbone + Head)
Initialization:
- Backbone: ImageNet-22K pretrained weights (Swin Transformer Base)
- Head: COCO pretrained weights (facebook/mask2former-swin-base-IN21k-coco-instance)
- 100 fixed learnable queries (from COCO pretraining)
Hyperparameters: Fixed (learning_rate=1e-5, weight_decay=1e-4, batch_size=8, epochs=10)
Outcome: Binary segmentation model with instance-aware predictions
Training: 256x256 pixel chips with boundary-weighted Dice loss

Stage 2: Site-Specific Adaptation (Planned)

Goal: Adapt to local soil colors, lighting, and building styles for production deployment.

Data: 1,000 project-specific chips (800 train / 200 validation) - target for production
Method: Frozen backbone + LoRA adapters + trainable head (not yet implemented)
Outcome: Site-specific model with strong generalization

Loss Function

To address irregular building shapes, we extend standard Mask2Former loss with boundary constraints.

Total Loss

$$ \mathcal{L}_{total} = \mathcal{L}_{base} + \alpha \cdot \mathcal{L}_{BoundaryDice} $$

Where $\mathcal{L}_{base}$ is the Mask2Former base loss (class loss + dice loss + mask loss) computed internally, and $\alpha$ is the boundary loss weight.

Loss functions

Cross-Entropy (L_CE): Binary classification (building vs background)
Dice Loss (L_Dice): Optimizes spatial overlap
Mask Loss: Combination of Dice loss and cross-entropy loss (Mask2Former replaced focal loss with cross-entropy. Source: https://davidhuangal.bearblog.dev/mask2former/)
Boundary Dice Loss (L_Boundary): Boundary-weighted Dice loss that penalizes edge mismatch using 10x weight for boundary pixels, enforcing sharp corners for irregular geometries (implemented in training_step)

Implementation Details

Hardware & Setup

GPU: ( Not sure yet )
Data Loading: TorchGeo streaming GeoTIFFs with RandomGeoSampler (256 pixel size)
Bridge: Custom collate_fn converting TorchGeo tensors to Mask2Former format
Batch Size: 8 (default in config)

Estimated Timeline

Activity	Duration	Notes
Code Implementation	~x hours	collate_fn and model definition

Current Hyperparameters (Stage 1)

Currently implementing Stage 1 with fixed hyperparameters (Stage 2 with LoRA and Optuna tuning is planned).

Optimizer Configuration

Using AdamW (Adam with decoupled weight decay): https://arxiv.org/pdf/1711.05101

Learning Rate: 1e-5 (0.00001) - conservative for fine-tuning pretrained model
Weight Decay: 1e-4 (0.0001) - L2 regularization
Scheduler: CosineAnnealingLR over 10 epochs

Loss Weights

Class Weight: 5.0
Dice Weight: 5.0
Mask Weight: 5.0
Boundary Loss Weight: 2.0

Future Tuning Plans (Stage 2)

Stage 2 will involve hyperparameter tuning using Optuna on validation set:

LoRA adapter configuration (rank 8/16/32)
Learning rate adjustment
Boundary loss weight optimization

References

Liu et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV 2021.
https://arxiv.org/abs/2103.14030
Cheng et al. (2022). Masked-attention Mask Transformer for Universal Image Segmentation. CVPR 2022.
https://arxiv.org/abs/2112.01527
Kervadec et al. (2019). Boundary Loss for Highly Unbalanced Segmentation. MIDL 2019.
https://arxiv.org/abs/1812.07032
Download RAMP Data : https://source.coop/ramp/ramp
RAMP Docs : https://rampml.global/training-data/

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
docs		docs
src		src
.dockerignore		.dockerignore
.env_sample		.env_sample
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Run.ipynb		Run.ipynb
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock
zenml_pipeline.py		zenml_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Motivation

Architecture Design

Component Selection

Training Strategy: Two-Stage Curriculum

Stage 1: Foundation Training (Current Implementation)

Stage 2: Site-Specific Adaptation (Planned)

Loss Function

Total Loss

Loss functions

Implementation Details

Hardware & Setup

Estimated Timeline

Current Hyperparameters (Stage 1)

Optimizer Configuration

Loss Weights

Future Tuning Plans (Stage 2)

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Motivation

Architecture Design

Component Selection

Training Strategy: Two-Stage Curriculum

Stage 1: Foundation Training (Current Implementation)

Stage 2: Site-Specific Adaptation (Planned)

Loss Function

Total Loss

Loss functions

Implementation Details

Hardware & Setup

Estimated Timeline

Current Hyperparameters (Stage 1)

Optimizer Configuration

Loss Weights

Future Tuning Plans (Stage 2)

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages