Introduction

Auto-Labeler for Aerial Imagery
Matthew Grigsby
Senior Data Scientist, IBM
[email protected]

Introduction

Background:
Semantic segmentation of buildings in a high resolution aerial image (~6500 x ~12500 pixels at .6M resolution) that belong to distinct classes. This is an important task for many GIS applications, such as urban planning, disaster response, and environmental monitoring. The objective is to create a model that can be used to label new images with minimal human intervention.

Goal of this repository:
Build initial model to prove out efficacy of segmentation architectures to complete this task using a single neural network. The resulting model should be able to segment the image with a reasonable degree of accuracy and speed. The model should also be able to generalize to new images.

Future goals:
Incorporate in a semi-supervised learning system with minimal human in the loop interaction as needed.

Methods

My approach to solving the challenge of labeling arial imagery was to utilize cutting-edge models developed for semantic segmentation tasks. The first step was cleaning up and finishing the annotations we were given (found here). Once I was satisfied with the quality of the resulting annotated image (the "ground truth"), I chopped it into overlapping tiles small enough for the model to ingest. Unfortunately, the original image was too large for ingesting all at once given software/hardware/time constraints. Furthermore, tiling the main image into overlapping pieces resulted in a set of images large enough to ease concerns regarding sample size when training a first draft model in proving out the efficacy of this approach.

Now that I had our tiles created from the original image and mask, I decided to remove masks and their corresponding images which contained only the background class (segment 0 in grayscale). Note: a valid alternate structure would be to not include a background class (e.g. 0 would be the first building segment type). Once this was done, the tiles were ready for training and moved into directories appropriate for my data loaders (TensorFlow's ImageDataGenerator.flow_from_directory class is particular about directory structure).

I then built out necessary classes for data loaders, model, and specifications for running our experiments. Initial runs were completed as a U-Net architecture with ResNet backbone and pretrained weights from ImageNet. I ultimately decided upon a multi-scale attention network (MA-Net) architecture with a SegFormer backbone and again pretrained weights from imagenet. Throughout the experimentation process, I tested varying combinations of architectures (U-Net, MA-Net, and FPN), backbones (ResNet(34, 50, 101, 152) and SegFormer mit_b1-3), weights (pretrained or random initialization), batch sizes (mostly dependent on model size and tile height/width), loss functions (Dice or Jaccard), and image augmentations (crop, flip, rotate, etc.). A record with most of my progress is saved here.

Results

Full image intersection over union (IoU) score for the final model was 0.81. This is a great score considering the complexity of the image and the fact that the model was trained on a single, albeit very high resolution, aerial image. The model was able to generalize to new images, as shown here. The model was also able to label (patch into 1024x2014, augment, predict, and rebuild full image) the image in a relatively quick amount of time (~24sec). This is much faster compared to the original, poorer performing (in both speed and segmentation ability) solution that took approximately 36min to complete.

References

Architectures:
U-Net: https://arxiv.org/pdf/1505.04597.pdf
MA-Net: https://ieeexplore.ieee.org/abstract/document/9201310
FPN: https://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w4/Seferbekov_Feature_Pyramid_Network_CVPR_2018_paper.pdf

Encoders/Backbones:
ResNet: https://arxiv.org/pdf/1512.03385.pdf
SegFormer: https://arxiv.org/pdf/2105.15203.pdf

Segmentation models libraries
PyTorch: https://github.com/qubvel/segmentation_models.pytorch
TensorFlow: https://github.com/qubvel/segmentation_models

Steps to recreate segmentation model

Annotate the image, painting each building with corresponding segment color (files found here)
Convert annotated image into grayscale: notebook
Cut the main image into smaller tiles and setup file directories: notebook
Specify model parameters such as epochs, batch size, loss function, etc.
- PyTorch version
- TensorFlow version
Create conda environment and install necessary packages
- PyTorch version
- TensorFlow version
Open a terminal, activate your environment, navigate to either the PyTorch or TensorFlow folders and run the corresponding main function. Training on GPU(s) is highly recommended.
View results once model finishes training:
- View model results on single tile here
- View prediction on entire image here
- Model and epoch history are saved here

Notes:

I completed this analysis on a computer with specs listed below. Training each model took approximately 2-4hrs with this hardware depending on the model architecture, backbone, and batch size. I used a batch size of 3 for the final model. I also used a batch size of 6 for the U-Net and FPN models, but a batch size of 3 for the MA-Net model. I found that the MA-Net model was more memory intensive than the other two models. This combined with a larger tile size of 1024 x 1024 pixels limited the batch size, even with a capable GPU such as the RTX3090 with 24GB VRAM.

CPU: Ryzen 9 5950X
GPU: NVDIA RTX 3090 24GB
RAM: 64GB DDR4 3600MHz
SSD: 2TB Sabrent NVMe M.2
OS: Windows 11 Pro
IDEs: PyCharm and VS Code

An interesting note about U-Net architectures is how it limits tile input sizes to powers of 2. In my case, I was able to experiment with 512 x 512 and 1024 x 1024. I went with the larger tile size because I believe it is ideal to preserve multiple classes in each tile, as well as minimize the number of classes that are cut in half. For example, a 512 x 512 tile may cut a building in half, whereas a 1024 x 1024 tile may preserve the entire building. This is important because the model will have a more difficult time learning the building class if it is only presented with half of the building.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
models		models
results		results
src		src
.gitattributes		.gitattributes
README.md		README.md
experiment_notes.txt		experiment_notes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Methods

Results

References

Steps to recreate segmentation model

Notes:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MatthewGrigsby/semantic-segmentation

Folders and files

Latest commit

History

Repository files navigation

Introduction

Methods

Results

References

Steps to recreate segmentation model

Notes:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages