Skip to content

How to avoid nodata-only patches #1330

Open
@adamjstewart

Description

@adamjstewart

Summary

Large scenes are often rotated due to CRS like so:

LANDSAT_1million_20170531

When using our GeoSamplers, we often sample from the nodata regions around the edges.

Rationale

Sampling these nodata-only patches results in slower I/O and slower GPU training, despite these patches contributing nothing to the model.

Implementation

There are two places where we could potentially improve performance.

I/O

Best case scenario would be to avoid sampling these regions entirely. In #449, someone tried adding a check to the sampler that actually loads the patch, checks if it's entirely nodata pixels, and skips it if so. Unfortunately, this doesn't seem to work in parallel, and is slow since it needs to load each patch twice.

If there was a way to load the image in its native CRS (i.e., a square with no nodata pixels in the orientation it was taken in), this would solve all of our problems. I don't know of a way to do this.

GPU

This is actually easier to solve. We could add a feature to our data module base classes that removes all nodata-only images from each mini-batch inside transfer_batch_to_device or another step. This would result in variable batch sizes, but I don't think that's an issue.

Alternatives

No response

Additional information

This is a highly requested feature:

Metadata

Metadata

Assignees

No one assigned

    Labels

    datamodulesPyTorch Lightning datamodulesdatasetsGeospatial or benchmark datasetssamplersSamplers for indexing datasets

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions