Description
For making spatial joins or overlays, spatial predicates, reading from spatially partitioned datasets, etc more efficient, we can have spatially partitioned dataframes: the bounds of each partition is known, and thus it can be checked based on those bounds whether on operation needs to involve that partition or not.
And then geodataframes can also be re-partitioned to optimize the bounds (minimize the overlap) as much as possible (initial costly shuffle operation, but can pay-off later).
This complicates the implementation (we need to keep track of the spatial partitioning, the partitions can change during spatial operations, ..), but I think it will also be critical for improving performance on large datasets.
How can we add this?
In the previous iteration at https://github.com/mrocklin/dask-geopandas, the dataframes had an additional _regions
attribute, which was a geopandas.GeoSeries with the "regions" of each partition (so len(regions) == npartitions
).
I think one advantage of using a GeoSeries is that this makes it easy to work with (eg it is easy to check which partitions would intersect with a given geometry).
In spatialpandas
(https://github.com/holoviz/spatialpandas), there is a combo of partition_bounds
and partition_sindex
.
The partition_bounds
is basically the total_bounds
of each partition (so you could see it as the _regions
but limited to a rectangular box and stores as the 4 (minx, miny, maxx, maxy) numbers). And then partition_sindex
is a spatial index built on the partition_bounds
.
See https://github.com/holoviz/spatialpandas/blob/master/spatialpandas/dask.py
I suppose starting with a basic "partition bounds" should be fine, and allows to later expand it with a spatial index or with more fine-grained shapes.