map_partitions (almost) only uses single core

I have two DataFrames that I want to apply an overlay (difference) on.
As GeoPandas internally uses an index, I try to use the map_partitions, to partition the larger GeoDataFrame and try to execute on several cores.

I once archived with intersection [github-project code](https://github.com/sehHeiden/geospeed/blob/master/geospeed/dask_geopandas_speed.py).

On my current dataset, I use:

Left larger GeoDataFrame has 944'420 rows.
Right smaller GeoDataFrame has 265'691 rows.

From with my functions I call `fast_difference`:

```python
import dask_geopandas as d_gpd
import geopandas as gpd

def difference_partitions(part: gpd.GeoDataFrame, right: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Helper function to calculate the difference overlay with dask."""
    return gpd.overlay(part, right, how="difference", keep_geom_type=True, make_valid=True)


def fast_difference(left: gpd.GeoDataFrame, right: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Execute the difference overlay with dask_geopandas."""
    left_dgdf = d_gpd.from_geopandas(left, npartitions=8)
    logging.info(left_dgdf.crs)
    return left_dgdf.map_partitions(difference_partitions, right).compute()
```

But it only runs on a single core. I set the logging level to DEBUG, with that I get:
`pyproj - DEBUG - PROJ_ERROR: proj_create: unrecognized format / unknown name`

about the time I call the `from_geopandas` method. That might be the only hint got. All vector files are in EPSG:25832 (tested via logging).

I tried 3, 4, 6, 8, 16 partitions in run and debug modes. Furthermore, I assume multiple cores are used (from btop4win I see cores reaching a lot of 10 to 30 %, a single core reaches up to 50 %), seldom I see spikes, on the overall load. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

map_partitions (almost) only uses single core #319

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

map_partitions (almost) only uses single core #319

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions