Description
spatialpandas
Pandas and Dask extensions for vectorized spatial and geometric operations.
This proposal is a plan towards extracting the functionality of the spatial/geometric utilities developed in Datashader into this separate, general-purpose library. Some of the functionality here has now (as of mid-2020) been implemented as marked below, but much remains to be done!
Goals
This project has several goals:
- Provide a set of pandas ExtensionArrays for storing columns of discrete geometric objects (
Points
,Polygons
, etc.). Unlike the GeoPandas GeoSeries, there will be a separate extension array for each geometry type (PointsArray
,PolygonsArray
, etc.), and the underlying representation for the entire array will be stored in a single contiguous ragged array that is suitable for fast processing using numba. - (partially done; see below) Provide a collection of vectorized geometric operations on these extension arrays, including most of the same operations that are included in shapely/geopandas, but implemented in numba rather than relying on the GEOS and libspatialindex C++ libraries. These vectorized Numba implementations will support CPU thread-based parallelization, and will also provide the foundation for future GPU support.
- Provide Dask DataFrame extensions for spatially partitioning DataFrames containing geometry columns. Also provide Dask DataFrame/Series implementations of the same geometric operations, but that also take advantage of spatial partitioning. This would effectively replace the Datashader
SpatialPointsFrame
and would support all geometry types, not only points. - Provide round-trip serialization of pandas/dask data structures containing geometry columns to/from parquet. Writing a Dask DataFrame to a partitioned parquet file would optionally include Hilbert R-tree packing to optimize the partitions for later spatial access on read.
- Fast import/export to/from shapely and geopandas. This may rely on the pygeos library to interface directly with the GEOS objects, rather than calling shapely methods.
These features will make it very efficient for Datashader to process large geometry collections. They will also make it more efficient for HoloViews to perform linked selection on large spatially distributed datasets.
Non-goals
spatialpandas will be focused on geometry only, not geography. As such:
- No built-in support for loading data from geography-specific file formats
- No dependency on GDAL/fiona
- No coordinate reference frame logic
Features
Extension Arrays
The spatialpandas.geometry
package will contain geometry classes and pandas extension arrays for each geometry type
- Point/PointArray: single point / array of points, one point per element. Maps to shapely
Point
class. - MultiPoint/MultiPointArray: multiple points / array of points, multiple points per element. Maps to shapely
Points
class. - MultiLines/MultiLinesArray: One or more lines / array of lines, multiple lines per element. Maps to shapely
LineString
andMultiLineString
classes. - Ring/RingArray: Single closed ring / array of rings, one per element. Maps to shapely
LinearRing
class. - Polygon/PolygonArray: One or more polygons, each with zero or more holes / array of polygons with holes, multiple per element. Maps to shapely
Polygon
/MultiPolygon
classes.
Spatial Index
-
The
spatialpandas.spatialindex
module will contain a vectorized and parallel numba implementation of a Hilbert-RTree. -
Each extension array has an
sindex
property that holds a lazily generated spatial index.
Extension Array Geometric Operations
The extension arrays above will have methods/properties for shapely-compatible geometric operations. These are implemented as parallel vectorized numba functions. Supportable operations include:
- area
- length
- bounds
- boundary
- buffer
- centroid
- convex_hull
- covers
- contains
- crosses
- difference
- disjoint
- distance
- envelope
- exterior
- hausdorff_distance
- interpolate
- intersection
- intersects_bounds
- minimum_rotated_rectangle
- overlaps
- project
- simplify
- union
- unary_union
- affine_transform
- rotate
- scale
- skew
- translate
- sjoin (spatial join) (see tools/sjoin) (partial, see intersection and spatial join support for Point arrays #21)
Only a minimal subset of these will be implemented by the core developers, but others can be added relatively easily by users and other contributors by starting with one of the implemented methods as an example, then adding code from other published libraries (but Numba-ized and Dask-ified if possible!).
Pandas accessors
Custom pandas Series accessor is included to expose these geometric operations at the Series level. E.g.
-
df.column.spatial.area
returns a pandas Series with the same index as df, containing area values. -
df.column.spatial.cx
is a geopandas-style spatial indexer for filtering a series spatially using a spatial index.
Custom pandas DataFrame accessor is included to track the current "active" geometry column for a DataFrame, and provide DataFrame level operations. E.g.
-
df.spatial.cx
will filter a dataframe spatially based on the current active geometry column.
Dask accessors
A custom Dask Series accessor is included to expose geometric operations on a Dask geometry Series. E.g.
-
ddf.column.spatial.area
returns a dask Series with the same index as ddf, containing area values. -
The accessor also holds a spatial index of bounding box of the shapes in each partition. This allows spatial operations (e.g. cx, spatial join) to skip processing entire partitions that will not contain relevant geometries.
-
A Custom dask DataFrame accessor is included that is exactly analogous to the pandas version.
Conversions
- Fast conversions to and from geopandas will be provided, potentially relying on pygeos.
Parquet support
-
read/to parquet for Pandas DataFrames will be able to rely on the standard pandas parquet functions with extension array support.
-
Special read/to parquet functions will be provided for Dask DataFrames.
-
to_parquet will add extra metadata to the parquet file to specify the geometric bounds of each partition. There will also be the option for
to_parquet
to use Hilbert R-tree packing to optimize the partitions for later spatial access on read. -
read_parquet
will read this partition bounding-box metadata and use it to pre-populate the spatial accessor for each geometry column with a partition-level spatial index.
Compatibility
- We would aim to use consistent naming between geopandas and spatialpandas whenever possible. Since spatialpandas will rely on a DataFrame accessor rather than a DataFrame subclass, spatial properties and methods are under the accessor rather than on the DataFrame.
For example, geodf.cx
becomes spdf.spatial.cx
.
- Eventually, a subclass compatibility layer could likely be added that would simply dispatch certain class methods/properties to the spatial accessor.
Testing
- The test suite will rely heavily on property testing (potentially with hypothesis) to compare the spatialpandas results to geopandas/shapely results.