Skip to content

DataFrame subclass type lost when assigning with unknown divisions #1180

@TomAugspurger

Description

@TomAugspurger

Describe the issue:

When assigning a new column to a dask object, it seems like the concrete subtype (e.g. geopandas.GeoDataFrame) is lost.

Minimal Complete Verifiable Example:

import dask.array
import dask.dataframe
import dask_geopandas
import geopandas
import pandas as pd

df = geopandas.GeoDataFrame({"geometry": geopandas.points_from_xy([0, 0], [0, 1])})
ddf = dask_geopandas.from_geopandas(df, npartitions=2)
ddf = ddf.clear_divisions()  # this is important

b = dask.dataframe.from_dask_array(dask.array.zeros((2,), chunks=(1, 1)), index=ddf.index)
ddf.assign(a=b).geometry.x.compute()  ## error

that raises

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[/var/folders/x7/__bs9yvx21qbvzb17sj4qsh40000gn/T/ipykernel_95282/3433075730.py](http://127.0.0.1:8888/var/folders/x7/__bs9yvx21qbvzb17sj4qsh40000gn/T/ipykernel_95282/3433075730.py) in ?()
      8 ddf = dask_geopandas.from_geopandas(df, npartitions=2)
      9 ddf = ddf.clear_divisions()  # this is important
     10 
     11 b = dask.dataframe.from_dask_array(dask.array.zeros((2,), chunks=(1, 1)), index=ddf.index)
---> 12 ddf.assign(a=b).geometry.x.compute()  ## error

[~/gh/TomAugspurger/dask-geopandas-spatial-partitioning/.direnv/python-3.12/lib/python3.12/site-packages/dask_expr/_collection.py](http://127.0.0.1:8888/lab/tree/~/gh/TomAugspurger/dask-geopandas-spatial-partitioning/.direnv/python-3.12/lib/python3.12/site-packages/dask_expr/_collection.py) in ?(self, fuse, concatenate, **kwargs)
    476         out = self
    477         if not isinstance(out, Scalar) and concatenate:
    478             out = out.repartition(npartitions=1)
    479         out = out.optimize(fuse=fuse)
--> 480         return DaskMethodsMixin.compute(out, **kwargs)

[~/gh/TomAugspurger/dask-geopandas-spatial-partitioning/.direnv/python-3.12/lib/python3.12/site-packages/dask/base.py](http://127.0.0.1:8888/lab/tree/~/gh/TomAugspurger/dask-geopandas-spatial-partitioning/.direnv/python-3.12/lib/python3.12/site-packages/dask/base.py) in ?(self, **kwargs)
    368         See Also
    369         --------
    370         dask.compute
    371         """
--> 372         (result,) = compute(self, traverse=False, **kwargs)
    373         return result

[~/gh/TomAugspurger/dask-geopandas-spatial-partitioning/.direnv/python-3.12/lib/python3.12/site-packages/dask/base.py](http://127.0.0.1:8888/lab/tree/~/gh/TomAugspurger/dask-geopandas-spatial-partitioning/.direnv/python-3.12/lib/python3.12/site-packages/dask/base.py) in ?(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    656         keys.append(x.__dask_keys__())
    657         postcomputes.append(x.__dask_postcompute__())
    658 
    659     with shorten_traceback():
--> 660         results = schedule(dsk, keys, **kwargs)
    661 
    662     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

[~/gh/TomAugspurger/dask-geopandas-spatial-partitioning/.direnv/python-3.12/lib/python3.12/site-packages/pandas/core/generic.py](http://127.0.0.1:8888/lab/tree/~/gh/TomAugspurger/dask-geopandas-spatial-partitioning/.direnv/python-3.12/lib/python3.12/site-packages/pandas/core/generic.py) in ?(self, name)
   6295             and name not in self._accessors
   6296             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6297         ):
   6298             return self[name]
-> 6299         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'x'

geopandas.GeoSeries objects automatically add .x and .y to the geometry columns. We're getting a regular pandas.Series, causing the error.

Anything else we need to know?:

Having unknown divisions does seem to be necessary. Commenting out the def = ddf.clear_divisions() line makes the error go away. So I think we can maybe narrow the search to AssignAlign (and not Assign)

Environment:

  • Dask version: 2024.12.0

  • dask-expr from main @ d7577a2

  • Python version:

  • Operating System:

  • Install method (conda, pip, source):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions