Skip to content

feat: add multi level merge sort that will always fit in memory #15700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rluvaton
Copy link
Contributor

@rluvaton rluvaton commented Apr 13, 2025

Which issue does this PR close?

Rationale for this change

We need merge sort that does not fail with out of memory

What changes are included in this PR?

Implemented multi level merge sort on top of SortPreservingMergeStream that spill intermediate result when not enough memory.

How does it work:

When using the MultiLevelMerge you provide in memory streams and spill files,
each spill file contain the memory size of the record batch with the largest memory consumption.

Why is this important?

SortPreservingMergeStream uses BatchBuilder which grow and shrink memory based on the record batches that it get. however if there is not enough memory it will just fail.

this solution will reserve beforehand for each spill file the worst case scenerio for the record batch size so there will be no way that there is not enough memory mid sorting.

it will also try to reduce buffer size and number of streams to the minimum when there is not enough memory and will only fail if there is not enough memory for holding 2 record batches with no buffering in the stream

It can also be easily adjusted to allow for predefined maximum memory to merge stream

Are these changes tested?

yes added fuzz test to aggregate

Are there any user-facing changes?

not really


Related to #15610

@github-actions github-actions bot added the core Core DataFusion crate label Apr 13, 2025


#[tokio::test]
async fn test_low_cardinality() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails on main on OOM


for batch in batches_to_spill {
in_progress_file.append_batch(&batch)?;

*max_record_batch_size =
(*max_record_batch_size).max(batch.get_actually_used_size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not realistic to correctly know a batch's size after a roundtrip of spilling and reading back, with this get_actually_used_size() implementation. The actual implementation might give us some surprise. The implementation can get even more complex in the future, for example we might implement extra encodings for #14078, and the memory size of a batch after reading back can be harder to estimate.

Copy link
Contributor Author

@rluvaton rluvaton Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the actual array content before spill and after spill is different this function will always return the correct result regardless of the spill file format as we calculate the actual array content size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some type of arrays with complex internal buffer management, a simple example is:
Before spilling an StringView array has 10MB actual content, backed by 3 * 4MB buffer.
After spilling and reading back, the reader implementation decided to use 1 * 16MB buffer instead.
Different allocation policies caused different fragmentation status, and physical memory consumed varies.

Here are some real bugs found recently due to similar reasons (this explains why I'm worried about inconsistent memory size for logically equivalent batches):
#14644
#14823
#13377
Note they're only caused by primitive and string arrays, for more complex types like struct, array, or other nested types, I think it's even more likely to see such inconsistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to reproduce that so I can better answer, how do you create that string view array so it will cause what you said?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A complete solution for stable and safe sort with spill
2 participants