-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Cascaded spill merge and re-spill #15610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
BTW, row_hash uses the sort preserving merge stream as well and has similar problem, I think this should be a solution outside the sort exec |
} | ||
|
||
/// Maximum number of spill files to merge in a single pass | ||
const MAX_SPILL_MERGE_DEGREE: usize = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be configurable based on the number of available tokio blocking tasks I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's only for POC.
Also, to have a fully working larger than memory sort, you need to spill in
In case the memory reservation is failing |
This PR and #15608 both implemented multi-level merge for This PR
#15608
I think we should refine the existing PR to be:
To summarize, I think this PR needs to be restructured to make future optimizations easier to implement. I don’t have a solid idea yet, so I’ll keep thinking and also wait to hear more opinions. |
I think the spilling-related problem in external aggregation is still larger-than-memory sort, the current aggregation implementation tries to re-implement the sort spilling logic which is already done in |
Could you elaborate? I don't get it. |
// Recursively merge spilled files | ||
// ────────────────────────────────────────────────────────────────────── | ||
let spill_files = std::mem::take(&mut self.finished_spill_files); | ||
let spill_files = self.recursively_merge_spill_files(spill_files).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can avoid recursion here if we don't have to use it?
The maximum number of pass of multi-pass external merge sort is "Total Passes = 1 (initial run) + ⌈log_d (number of runs)⌉" for d way merge sort. We can use this information to convert the recursion into a for loop (recursion has many performance disadvantages compare to loop).
} | ||
|
||
/// Recursively merges and re-spills files until the number of spill files is ≤ MAX_SPILL_MERGE_DEGREE | ||
async fn recursively_merge_spill_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this name misleading, looks like this is not a recursive function. Maybe we can change it to something more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, updated. I changed my mind midway through implementing this function.
let sort_exprs: LexOrdering = self.expr.iter().cloned().collect(); | ||
|
||
// ==== Doing sort-preserving merge on input partially sorted streams ==== | ||
let spm_stream = StreamingMergeBuilder::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this StreamingMergeBuilder uses heap under the hood, using heap is a common method to optimize external merge sort performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's an in-house implementation of loser-tree heap. If we don't limit the degree of merge, for large sort queries, this step is the bottleneck now, maybe there is some room to optimize inside 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is already pretty well optimized (not that it couldn't be made better) but there isn't a lot of low hanging fruit in my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor issues, the algorithm itself looks good to me in general. I can take a closer look in details if needed (I'm not familiar with this part of the codebase yet, but I'll try my best to provide good review comments).
And some other thoughts:
- This is a pretty complicated program, maybe we should write some unit tests to make sure it doesn't break for future modifications?
- One idea to improve the performance is to dynamic calculate the optimal merge degree based on file size and memory size, or maybe multi-thread the merge phase (not sure if it is feasible)
self.in_mem_sort_stream(self.metrics.baseline.intermediate())?; | ||
debug!("SPM stream is constructed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean up debug logs if they are not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool -- thank you @2010YOUY01 and @rluvaton and @qstommyshu
I think in general (can do as a follow on PR) we will need to introduce some parallelism as well to really maximize performance.
Specifically, the merge operation is fundamentally single threaded (the hot loop in the merge). Thus I would expect merging one set of 10 files to likely be CPU bottlenecked
So we probably would need to try and merge multiple sets of 10 files in parallel (to multiple different output files) before we either bottlenecked on CPU or on I/O throughput
What I think would really help make progress in this area is A benchmark. I filed a ticket to track this issue:
Thank you all for the review! @qstommyshu I agree with the implementation-level feedbacks. I will address them in the refactor. @alamb Regarding parallel merging: I was thinking if I think the next steps are
|
I'll try to do most of the testing and cover edge cases in integration tests at https://github.com/apache/datafusion/blob/main/datafusion/core/tests/fuzz_cases/sort_fuzz.rs and https://github.com/apache/datafusion/blob/main/datafusion/core/tests/fuzz_cases/sort_query_fuzz.rs, instead of doing extensive UTs. I think we should promote tests to a higher level (SQL) when possible, because that API is much more stable and easier to manage. If a feature is tested mostly through unit tests, and someone later refactors the component away, those tests are likely to get lost—they might assume the feature is already covered by integration tests. I first heard this idea in a talk by the DuckDB developers https://youtu.be/BgC79Zt2fPs?si=WiziGqJ8Dlz6-MMW |
Yes I totally agree when possible SQL (or dataframe) is a better level to test at (and because it is the API that most users care about, not the internal details) |
Benchmark results: EnvironmentMacBook Pro with m4-pro chip (disk bandwidth is around 8000MB/s) Sorting 'thin' table
Main: 37s (merge ~170 spill files at once) Sorting 'fat' tableRun
Benchmark command cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 7 --memory-limit 1.2G Notes:
ResultMain (1.2G):
Main (500M):
PR (1.2G):
PR (500M):
|
I have made the following updates:
It's ready for another look. |
#[case(1)] | ||
#[tokio::test] | ||
async fn test_invalid_sort_max_spill_merge_degree( | ||
#[case] sort_max_spill_merge_degree: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this #[case] syntax looks so elegant for writing repetitive tests
@@ -84,6 +84,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus | |||
| datafusion.execution.skip_physical_aggregate_schema_check | false | When set to true, skips verifying that the schema produced by planning the input of `LogicalPlan::Aggregate` exactly matches the schema of the input plan. When set to false, if the schema does not match exactly (including nullability and metadata), a planning error will be raised. This is used to workaround bugs in the planner that are now caught by the new schema verification step. | | |||
| datafusion.execution.sort_spill_reservation_bytes | 10485760 | Specifies the reserved memory for each spillable sort operation to facilitate an in-memory merge. When a sort operation spills to disk, the in-memory data must be sorted and merged before being written to a file. This setting reserves a specific amount of memory for that in-memory sort/merge process. Note: This setting is irrelevant if the sort operation cannot spill (i.e., if there's no `DiskManager` configured). | | |||
| datafusion.execution.sort_in_place_threshold_bytes | 1048576 | When sorting, below what size should data be concatenated and sorted in a single RecordBatch rather than sorted in batches and merged. | | |||
| datafusion.execution.sort_max_spill_merge_degree | 16 | When doing external sorting, the maximum number of spilled files to read back at once. Those read files in the same merge step will be sort- preserving-merged and re-spilled, and the step will be repeated to reduce the number of spilled files in multiple passes, until a final sorted run can be produced. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great attention to detail, for updating the user guide!
Thanks for the clear explanation, that's a lot of great works, and looks really cool! |
I plan to re-review this tomorrow |
/// preserving-merged and re-spilled, and the step will be repeated to reduce | ||
/// the number of spilled files in multiple passes, until a final sorted run | ||
/// can be produced. | ||
pub sort_max_spill_merge_degree: usize, default = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern about this that there can still be memory issue, if the batch from each stream together is above the memory limit
I have an implementation for this that is completely memory safe and will try to create a PR for that for inspiration
The way to decide on the degree is actually by storing for each spill file the largest amount of memory a single record batch taken, and then when deciding on the degree, you simply grow until you can no longer.
The reason why I'm picky about this is that it is a new configuration that will be hard to deprecate or change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@2010YOUY01 and @alamb I hope before you merge this PR to look at #15700 to see what I mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I'm picky about this is that it is a new configuration that will be hard to deprecate or change
This is a solid point, this option is intended to be manually set, and it has to ensure (max_batch_size * per_partition_merge_degree * partition_count) < total_memory_limit
. If it's set correctly for a query, then the query should succeed.
The problem is the ever-growing number of configurations in DataFusion, and it seems impossible to set them all correctly. Enabling parallel merging optimization would require introducing yet another configuration, I'm also trying to avoid that (though too-many-configs problem might be a harsh reality we must accept).
Maybe the description for #15700 might help |
Thank you for providing an alternative approach. I described my primary concern in #15700 (comment), I think it is not realistic to determine a batch’s memory size after a spilling roundtrip due to the implementation complexity. In such cases, if the estimation is off by a factor of 2, the actual memory usage could also increase by a factor of 2, which is not ideal. |
Thank you, can you please take the fuzz test that I created in my pr and add it to yours, making sure it will pass (it will require you updating |
Which issue does this PR close?
Rationale for this change
Background for memory-limited sort execution
See figures in https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/sorts/sort.rs
Current limitation for reading back spill files
Let's say the memory buffer can only hold 10 batches, but there are 900 files spilled, the current implementation will try to merge 900 files at once, and fail the query.
However, this scenario is possible to continue if it can only merge 10 files at a time, re-spill, until there are only less than 10 spill files in total, and finally read them back and merge them as the output.
High-level approach of this PR
Added one configuration option for max spill merge degree (haven't implemented now, in POC it's a hard-coded
const MAX_SPILL_MERGE_DEGREE
for simplicity)At the final stage of external sort, there are initial spill files to merge, perform multi-pass read-merge-respill operation, the number of merged spill file in the next pass is decided by the closet power of
MAX_SPILL_MERGE_DEGREE
Example:
Initial spill files to merge: 900
max merge degree: 10
pass1: merge 900 files to 100 (closet power of 10) files
pass2: 100 -> 10
pass3: 10 -> final single output
Inside each pass the number of files to merge in each step is split as even as possible, while is always <= max merge degree, see details in the implementation.
What changes are included in this PR?
Updated the
sort.rs
for multi-pass reading spill. The entry point for the described logic is functionmerge_spilled_files_multi_pass()
Are these changes tested?
To be done, I think it's adding varying
max_spill_merge_degree
to the sort fuzzer.Are there any user-facing changes?
No