Open
Description
to_feather
fails when trying to write a dataframe to disk with an odd error.
>>> import dask.dataframe as dd
>>> import dask_geopandas as dgpd
>>> import geopandas as gpd
>>> import numpy as np
>>>
>>> dfs = []
>>> N = 5
>>> for i in range(3):
>>> gs = gpd.points_from_xy(np.arange(N), np.arange(N), crs=5070)
>>> df = gpd.GeoDataFrame({"data": np.full(N, i), "geometry": gs})
>>> dfs.append(dgpd.from_geopandas(df, npartitions=1))
>>> ddf = dd.concat(dfs)
>>> print(ddf.compute())
data geometry
0 0 POINT (0 0)
1 0 POINT (1 1)
2 0 POINT (2 2)
3 0 POINT (3 3)
4 0 POINT (4 4)
0 1 POINT (0 0)
1 1 POINT (1 1)
2 1 POINT (2 2)
3 1 POINT (3 3)
4 1 POINT (4 4)
0 2 POINT (0 0)
1 2 POINT (1 1)
2 2 POINT (2 2)
3 2 POINT (3 3)
4 2 POINT (4 4)
>>> ddf.to_feather("test.feather")
Traceback (most recent call last):
File "/var/mnt/fastdata02/mtbs/src/feather_error.py", line 13, in <module>
ddf.to_feather("test.feather")
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask_geopandas/expr.py", line 682, in to_feather
return to_feather(self, path, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask_geopandas/io/arrow.py", line 433, in to_feather
return compute_as_if_collection(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/base.py", line 399, in compute_as_if_collection
return schedule(dsk2, keys, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/threaded.py", line 91, in get
results = get_async(
^^^^^^^^^^
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/local.py", line 516, in get_async
raise_exception(exc, tb)
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/local.py", line 324, in reraise
raise exc
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/local.py", line 229, in execute_task
result = task(data)
^^^^^^^^^^
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/_task_spec.py", line 741, in __call__
return self.func(*new_argspec)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/utils.py", line 79, in apply
return func(*args)
^^^^^^^^^^^
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask_geopandas/io/arrow.py", line 149, in write_partition
table = cls._pandas_to_arrow_table(df, preserve_index=None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask_geopandas/io/arrow.py", line 202, in _pandas_to_arrow_table
table = _geopandas_to_arrow(df, index=preserve_index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/geopandas/io/arrow.py", line 340, in _geopandas_to_arrow
_validate_dataframe(df)
File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/geopandas/io/arrow.py", line 242, in _validate_dataframe
raise ValueError("Writing to Parquet/Feather only supports IO with DataFrames")
ValueError: Writing to Parquet/Feather only supports IO with DataFrames
If I add print(f"{type(df) = }\n{df}")
at the start of geopandas.io.arrow._validate_dataframe
, where the error occurs, I get the following:
type(df) = <class 'tuple'>
('concat-213dda7de847a70669800b78bbdabba7', 2)
type(df) = <class 'tuple'>
('concat-213dda7de847a70669800b78bbdabba7', 0)
type(df) = <class 'tuple'>
('concat-213dda7de847a70669800b78bbdabba7', 1)
It seems that the partitions are being evaluated to dask graph key tuples instead of the actual dataframes. I have run into similar issues with other dask_geopandas
workflows, but they were random and I could only get it to happen infrequently (1/1000). This is the first case that reliably reproduces it. It began when I updated to include dask-expr
.
Environment:
python 3.12.8 h9e4cc4f_1_cpython conda-forge
dask 2025.1.0 pyhd8ed1ab_0 conda-forge
dask-core 2025.1.0 pyhd8ed1ab_0 conda-forge
dask-expr 2.0.0 pyhd8ed1ab_0 conda-forge
dask-geopandas 0.4.3 pyhd8ed1ab_0 conda-forge
geopandas-base 1.0.1 pyha770c72_3 conda-forge
Metadata
Metadata
Assignees
Labels
No labels