Open
Description
Describe the issue:
pandas started splitting blocks in 2.0 to improve performance of setitem when a full column is replaced. This keeps unused data in memory.
Minimal Complete Verifiable Example:
import dask.array as da
import dask.dataframe as dd
# Create columns with 400MB each
ddf = dd.from_array(da.random.random((50_000_000, 10)), columns=list("abcdefghij"))
ddf["b"] = 1
# ddf = ddf.rename(columns={"a": "x"})
ddf.persist()
cc @crusaderky we chatted offline about this last week. Anything we can do here? Should this be counted as managed memory?
Rename triggers a deep copy before we persist, which brings the unmanaged memory down.
Anything else we need to know?:
Environment:
- Dask version: 2023.04
- pandas 2.0
- Python version: 3.10
- Operating System: Mac OS
- Install method (conda, pip, source): conda