Open
Description
Is your feature request related to a problem or challenge?
There are many interesting ideas on how to improve DataFusion while spilling for example #15271 from @2010YOUY01 and others.
What I think we really need next to make progress in this area is a benchmark / agreed upon way of measuring our progress so that we can improve and
Describe the solution you'd like
I would like a documented command / set of commands that is:
- Easy to run (and thus fast to test / iterate on)
- Exercises the spilling feature at different levels of memory pressure
- Spends most of its time sorting/spilling/merging (not generating output for example)
Describe alternatives you've considered
idea 1: can use some datafusion-cli
features / flags and document them
Idea 2: Add a new suite to bench.sh / dfbench
: https://github.com/apache/datafusion/tree/main/benchmarks
As for what to do I suggest something relatively simple like sorting the TPCH lineitem table with 200MB, 500MB, 1GB, 5GB and 10GB of memory for example
Additional context
No response