Skip to content

Benchmark / program to test Spilling Sorts #15664

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

There are many interesting ideas on how to improve DataFusion while spilling for example #15271 from @2010YOUY01 and others.

What I think we really need next to make progress in this area is a benchmark / agreed upon way of measuring our progress so that we can improve and

Describe the solution you'd like

I would like a documented command / set of commands that is:

  1. Easy to run (and thus fast to test / iterate on)
  2. Exercises the spilling feature at different levels of memory pressure
  3. Spends most of its time sorting/spilling/merging (not generating output for example)

Describe alternatives you've considered

idea 1: can use some datafusion-cli features / flags and document them

Idea 2: Add a new suite to bench.sh / dfbench: https://github.com/apache/datafusion/tree/main/benchmarks

As for what to do I suggest something relatively simple like sorting the TPCH lineitem table with 200MB, 500MB, 1GB, 5GB and 10GB of memory for example

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions