Description
I have two DataFrames that I want to apply an overlay (difference) on.
As GeoPandas internally uses an index, I try to use the map_partitions, to partition the larger GeoDataFrame and try to execute on several cores.
I once archived with intersection github-project code.
On my current dataset, I use:
Left larger GeoDataFrame has 944'420 rows.
Right smaller GeoDataFrame has 265'691 rows.
From with my functions I call fast_difference
:
import dask_geopandas as d_gpd
import geopandas as gpd
def difference_partitions(part: gpd.GeoDataFrame, right: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
"""Helper function to calculate the difference overlay with dask."""
return gpd.overlay(part, right, how="difference", keep_geom_type=True, make_valid=True)
def fast_difference(left: gpd.GeoDataFrame, right: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
"""Execute the difference overlay with dask_geopandas."""
left_dgdf = d_gpd.from_geopandas(left, npartitions=8)
logging.info(left_dgdf.crs)
return left_dgdf.map_partitions(difference_partitions, right).compute()
But it only runs on a single core. I set the logging level to DEBUG, with that I get:
pyproj - DEBUG - PROJ_ERROR: proj_create: unrecognized format / unknown name
about the time I call the from_geopandas
method. That might be the only hint got. All vector files are in EPSG:25832 (tested via logging).
I tried 3, 4, 6, 8, 16 partitions in run and debug modes. Furthermore, I assume multiple cores are used (from btop4win I see cores reaching a lot of 10 to 30 %, a single core reaches up to 50 %), seldom I see spikes, on the overall load.