Skip to content

Unmanaged memory because of block splitting in pandas #7800

Open
@phofl

Description

@phofl

Describe the issue:

pandas started splitting blocks in 2.0 to improve performance of setitem when a full column is replaced. This keeps unused data in memory.

Minimal Complete Verifiable Example:

import dask.array as da
import dask.dataframe as dd

# Create columns with 400MB each
ddf = dd.from_array(da.random.random((50_000_000, 10)), columns=list("abcdefghij"))


ddf["b"] = 1
# ddf = ddf.rename(columns={"a": "x"})
ddf.persist()

cc @crusaderky we chatted offline about this last week. Anything we can do here? Should this be counted as managed memory?
Rename triggers a deep copy before we persist, which brings the unmanaged memory down.

Anything else we need to know?:

Environment:

  • Dask version: 2023.04
  • pandas 2.0
  • Python version: 3.10
  • Operating System: Mac OS
  • Install method (conda, pip, source): conda

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions