Auto-Labeler for Aerial Imagery
Matthew Grigsby
Senior Data Scientist, IBM
[email protected]
Background:
Semantic segmentation of buildings in a high resolution aerial image (~6500 x ~12500 pixels at .6M resolution) that belong to distinct classes. This is an important task for many GIS applications, such as urban planning, disaster response, and environmental monitoring. The objective is to create a model that can be used to label new images with minimal human intervention.
Goal of this repository:
Build initial model to prove out efficacy of segmentation architectures to complete this task using a single neural network. The resulting model should be able to segment the image with a reasonable degree of accuracy and speed. The model should also be able to generalize to new images.
Future goals:
Incorporate in a semi-supervised learning system with minimal human in the loop interaction as needed.
My approach to solving the challenge of labeling arial imagery was to utilize cutting-edge models developed for semantic segmentation tasks. The first step was cleaning up and finishing the annotations we were given (found here). Once I was satisfied with the quality of the resulting annotated image (the "ground truth"), I chopped it into overlapping tiles small enough for the model to ingest. Unfortunately, the original image was too large for ingesting all at once given software/hardware/time constraints. Furthermore, tiling the main image into overlapping pieces resulted in a set of images large enough to ease concerns regarding sample size when training a first draft model in proving out the efficacy of this approach.
Now that I had our tiles created from the original image and mask, I decided to remove masks and their corresponding images which contained only the background class (segment 0 in grayscale). Note: a valid alternate structure would be to not include a background class (e.g. 0 would be the first building segment type). Once this was done, the tiles were ready for training and moved into directories appropriate for my data loaders (TensorFlow's ImageDataGenerator.flow_from_directory class is particular about directory structure).
I then built out necessary classes for data loaders, model, and specifications for running our experiments. Initial runs were completed as a U-Net architecture with ResNet backbone and pretrained weights from ImageNet. I ultimately decided upon a multi-scale attention network (MA-Net) architecture with a SegFormer backbone and again pretrained weights from imagenet. Throughout the experimentation process, I tested varying combinations of architectures (U-Net, MA-Net, and FPN), backbones (ResNet(34, 50, 101, 152) and SegFormer mit_b1-3), weights (pretrained or random initialization), batch sizes (mostly dependent on model size and tile height/width), loss functions (Dice or Jaccard), and image augmentations (crop, flip, rotate, etc.). A record with most of my progress is saved here.
Full image intersection over union (IoU) score for the final model was 0.81. This is a great score considering the complexity of the image and the fact that the model was trained on a single, albeit very high resolution, aerial image. The model was able to generalize to new images, as shown here. The model was also able to label (patch into 1024x2014, augment, predict, and rebuild full image) the image in a relatively quick amount of time (~24sec). This is much faster compared to the original, poorer performing (in both speed and segmentation ability) solution that took approximately 36min to complete.
Architectures:
U-Net: https://arxiv.org/pdf/1505.04597.pdf
MA-Net: https://ieeexplore.ieee.org/abstract/document/9201310
FPN: https://openaccess.thecvf.com/content_cvpr_2018_workshops/papers/w4/Seferbekov_Feature_Pyramid_Network_CVPR_2018_paper.pdf
Encoders/Backbones:
ResNet: https://arxiv.org/pdf/1512.03385.pdf
SegFormer: https://arxiv.org/pdf/2105.15203.pdf
Segmentation models libraries
PyTorch: https://github.com/qubvel/segmentation_models.pytorch
TensorFlow: https://github.com/qubvel/segmentation_models
- Annotate the image, painting each building with corresponding segment color (files found here)
- Convert annotated image into grayscale: notebook
- Cut the main image into smaller tiles and setup file directories: notebook
- Specify model parameters such as epochs, batch size, loss function, etc.
- Create conda environment and install necessary packages
- Open a terminal, activate your environment, navigate to either the PyTorch or TensorFlow folders and run the corresponding main function. Training on GPU(s) is highly recommended.
- View results once model finishes training:
I completed this analysis on a computer with specs listed below. Training each model took approximately 2-4hrs with this hardware depending on the model architecture, backbone, and batch size. I used a batch size of 3 for the final model. I also used a batch size of 6 for the U-Net and FPN models, but a batch size of 3 for the MA-Net model. I found that the MA-Net model was more memory intensive than the other two models. This combined with a larger tile size of 1024 x 1024 pixels limited the batch size, even with a capable GPU such as the RTX3090 with 24GB VRAM.
- CPU: Ryzen 9 5950X
- GPU: NVDIA RTX 3090 24GB
- RAM: 64GB DDR4 3600MHz
- SSD: 2TB Sabrent NVMe M.2
- OS: Windows 11 Pro
- IDEs: PyCharm and VS Code
An interesting note about U-Net architectures is how it limits tile input sizes to powers of 2. In my case, I was able to experiment with 512 x 512 and 1024 x 1024. I went with the larger tile size because I believe it is ideal to preserve multiple classes in each tile, as well as minimize the number of classes that are cut in half. For example, a 512 x 512 tile may cut a building in half, whereas a 1024 x 1024 tile may preserve the entire building. This is important because the model will have a more difficult time learning the building class if it is only presented with half of the building.