ENH: spatial partitioning of the GeoDataFrame

For making spatial joins or overlays, spatial predicates, reading from spatially partitioned datasets, etc more efficient, we can have *spatially partitioned* dataframes: the bounds of each partition is known, and thus it can be checked based on those bounds whether on operation needs to involve that partition or not. 
And then geodataframes can also be re-partitioned to optimize the bounds (minimize the overlap) as much as possible (initial costly shuffle operation, but can pay-off later).

This complicates the implementation (we need to keep track of the spatial partitioning, the partitions can change during spatial operations, ..), but I think it will also be critical for improving performance on large datasets.

---

How can we add this?

In the previous iteration at https://github.com/mrocklin/dask-geopandas, the dataframes had an additional `_regions` attribute, which was a geopandas.GeoSeries with the "regions" of each partition (so `len(regions) == npartitions`).

See https://github.com/mrocklin/dask-geopandas/blob/8133969bf03d158f51faf85d020641e86c9a7e28/dask_geopandas/core.py#L50

I think one advantage of using a GeoSeries is that this makes it easy to work with (eg it is easy to check which partitions would intersect with a given geometry).

In `spatialpandas` (https://github.com/holoviz/spatialpandas), there is a combo of `partition_bounds` and `partition_sindex`. 
The `partition_bounds` is basically the `total_bounds` of each partition (so you could see it as the `_regions` but limited to a rectangular box and stores as the 4 (minx, miny, maxx, maxy) numbers). And then `partition_sindex` is a spatial index built on the `partition_bounds`.

See https://github.com/holoviz/spatialpandas/blob/master/spatialpandas/dask.py

---

I suppose starting with a basic "partition bounds" should be fine, and allows to later expand it with a spatial index or with more fine-grained shapes.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: spatial partitioning of the GeoDataFrame #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: spatial partitioning of the GeoDataFrame #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions