Description
Summary
Large scenes are often rotated due to CRS like so:
When using our GeoSamplers, we often sample from the nodata regions around the edges.
Rationale
Sampling these nodata-only patches results in slower I/O and slower GPU training, despite these patches contributing nothing to the model.
Implementation
There are two places where we could potentially improve performance.
I/O
Best case scenario would be to avoid sampling these regions entirely. In #449, someone tried adding a check to the sampler that actually loads the patch, checks if it's entirely nodata pixels, and skips it if so. Unfortunately, this doesn't seem to work in parallel, and is slow since it needs to load each patch twice.
If there was a way to load the image in its native CRS (i.e., a square with no nodata pixels in the orientation it was taken in), this would solve all of our problems. I don't know of a way to do this.
GPU
This is actually easier to solve. We could add a feature to our data module base classes that removes all nodata-only images from each mini-batch inside transfer_batch_to_device
or another step. This would result in variable batch sizes, but I don't think that's an issue.
Alternatives
No response
Additional information
This is a highly requested feature: