Skip to content

ScalarValue::compacted() does not free unused view-buffer memory for Utf8View / BinaryView arrays #21928

@bert-beyondloops

Description

@bert-beyondloops

Describe the bug

ScalarValue::compacted() is documented as producing a scalar that minimises its memory footprint by discarding unreferenced array data. For most array types this works correctly, but for Utf8View and BinaryView (and any container type — Struct, List, LargeList, … — whose leaf values have a view type), the method silently fails to release the original buffer allocation. The scalar continues to hold a live Arc reference into the source batch, keeping the entire batch allocation alive for as long as the scalar exists.

ScalarValue::compacted() eventually calls copy_array_data, which for view-based arrays Arc-clones the existing data buffers rather than copying the bytes that the scalar actually references. View arrays can carry multiple large, discontiguous data buffers; a single-character view holds a 128-bit inline or pointer-style descriptor that may reference a tiny slice deep inside a 64 MiB buffer. After compacted() the Arc count of those buffers is incremented by one, but the allocations themselves are unchanged.

The correct primitive is StringViewArray::gc() / BinaryViewArray::gc(), which copies only the live bytes into a fresh, right-sized allocation and drops the originals. DataFusion's ScalarValue::compacted() never calls this method.

To Reproduce

No response

Expected behavior

After scalar.compacted(), the scalar's total heap allocation should be proportional to the data it actually contains — not to the source batch it was originally extracted from.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions