Describe the bug
ScalarValue::compacted() is documented as producing a scalar that minimises its memory footprint by discarding unreferenced array data. For most array types this works correctly, but for Utf8View and BinaryView (and any container type — Struct, List, LargeList, … — whose leaf values have a view type), the method silently fails to release the original buffer allocation. The scalar continues to hold a live Arc reference into the source batch, keeping the entire batch allocation alive for as long as the scalar exists.
ScalarValue::compacted() eventually calls copy_array_data, which for view-based arrays Arc-clones the existing data buffers rather than copying the bytes that the scalar actually references. View arrays can carry multiple large, discontiguous data buffers; a single-character view holds a 128-bit inline or pointer-style descriptor that may reference a tiny slice deep inside a 64 MiB buffer. After compacted() the Arc count of those buffers is incremented by one, but the allocations themselves are unchanged.
The correct primitive is StringViewArray::gc() / BinaryViewArray::gc(), which copies only the live bytes into a fresh, right-sized allocation and drops the originals. DataFusion's ScalarValue::compacted() never calls this method.
To Reproduce
No response
Expected behavior
After scalar.compacted(), the scalar's total heap allocation should be proportional to the data it actually contains — not to the source batch it was originally extracted from.
Additional context
No response
Describe the bug
ScalarValue::compacted() is documented as producing a scalar that minimises its memory footprint by discarding unreferenced array data. For most array types this works correctly, but for Utf8View and BinaryView (and any container type — Struct, List, LargeList, … — whose leaf values have a view type), the method silently fails to release the original buffer allocation. The scalar continues to hold a live Arc reference into the source batch, keeping the entire batch allocation alive for as long as the scalar exists.
ScalarValue::compacted() eventually calls copy_array_data, which for view-based arrays Arc-clones the existing data buffers rather than copying the bytes that the scalar actually references. View arrays can carry multiple large, discontiguous data buffers; a single-character view holds a 128-bit inline or pointer-style descriptor that may reference a tiny slice deep inside a 64 MiB buffer. After compacted() the Arc count of those buffers is incremented by one, but the allocations themselves are unchanged.
The correct primitive is StringViewArray::gc() / BinaryViewArray::gc(), which copies only the live bytes into a fresh, right-sized allocation and drops the originals. DataFusion's ScalarValue::compacted() never calls this method.
To Reproduce
No response
Expected behavior
After scalar.compacted(), the scalar's total heap allocation should be proportional to the data it actually contains — not to the source batch it was originally extracted from.
Additional context
No response