Skip to content

ENH: spatial partitioning of the GeoDataFrame #8

Open
@jorisvandenbossche

Description

@jorisvandenbossche

For making spatial joins or overlays, spatial predicates, reading from spatially partitioned datasets, etc more efficient, we can have spatially partitioned dataframes: the bounds of each partition is known, and thus it can be checked based on those bounds whether on operation needs to involve that partition or not.
And then geodataframes can also be re-partitioned to optimize the bounds (minimize the overlap) as much as possible (initial costly shuffle operation, but can pay-off later).

This complicates the implementation (we need to keep track of the spatial partitioning, the partitions can change during spatial operations, ..), but I think it will also be critical for improving performance on large datasets.


How can we add this?

In the previous iteration at https://github.com/mrocklin/dask-geopandas, the dataframes had an additional _regions attribute, which was a geopandas.GeoSeries with the "regions" of each partition (so len(regions) == npartitions).

See https://github.com/mrocklin/dask-geopandas/blob/8133969bf03d158f51faf85d020641e86c9a7e28/dask_geopandas/core.py#L50

I think one advantage of using a GeoSeries is that this makes it easy to work with (eg it is easy to check which partitions would intersect with a given geometry).

In spatialpandas (https://github.com/holoviz/spatialpandas), there is a combo of partition_bounds and partition_sindex.
The partition_bounds is basically the total_bounds of each partition (so you could see it as the _regions but limited to a rectangular box and stores as the 4 (minx, miny, maxx, maxy) numbers). And then partition_sindex is a spatial index built on the partition_bounds.

See https://github.com/holoviz/spatialpandas/blob/master/spatialpandas/dask.py


I suppose starting with a basic "partition bounds" should be fine, and allows to later expand it with a spatial index or with more fine-grained shapes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions