Skip to content

map_partitions (almost) only uses single core #319

Open
@sehHeiden

Description

@sehHeiden

I have two DataFrames that I want to apply an overlay (difference) on.
As GeoPandas internally uses an index, I try to use the map_partitions, to partition the larger GeoDataFrame and try to execute on several cores.

I once archived with intersection github-project code.

On my current dataset, I use:

Left larger GeoDataFrame has 944'420 rows.
Right smaller GeoDataFrame has 265'691 rows.

From with my functions I call fast_difference:

import dask_geopandas as d_gpd
import geopandas as gpd

def difference_partitions(part: gpd.GeoDataFrame, right: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Helper function to calculate the difference overlay with dask."""
    return gpd.overlay(part, right, how="difference", keep_geom_type=True, make_valid=True)


def fast_difference(left: gpd.GeoDataFrame, right: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Execute the difference overlay with dask_geopandas."""
    left_dgdf = d_gpd.from_geopandas(left, npartitions=8)
    logging.info(left_dgdf.crs)
    return left_dgdf.map_partitions(difference_partitions, right).compute()

But it only runs on a single core. I set the logging level to DEBUG, with that I get:
pyproj - DEBUG - PROJ_ERROR: proj_create: unrecognized format / unknown name

about the time I call the from_geopandas method. That might be the only hint got. All vector files are in EPSG:25832 (tested via logging).

I tried 3, 4, 6, 8, 16 partitions in run and debug modes. Furthermore, I assume multiple cores are used (from btop4win I see cores reaching a lot of 10 to 30 %, a single core reaches up to 50 %), seldom I see spikes, on the overall load.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions