Skip to content

Conversation

@CarterFendley
Copy link

This PR is intended to start a discussion on how to re-use as much of the benchmark code as possible. Specifically, this proposes the creation of a function similar to run_query(...) as prototyped in this commit editing dask groupby query logic:

def run_query(
    data_name: str,
    in_rows: int,
    x: dd.DataFrame,
    query: Query,
    question: str,
    runs: int = 2,
):
    logger.info("Running query: '%s'" % question)
    try:
        for run in range(1, runs+1):
            gc.collect() # TODO: Able to do this in worker processes? Want to?

            # Calculate ans
            t_start = timeit.default_timer()
            ans = query.query(x)
            logger.debug("Answer shape: %s" % (ans.shape, ))
            t = timeit.default_timer() - t_start
            m = memory_usage()

            # Calculate chk
            t_start = timeit.default_timer()
            chk = query.check(ans)
            chkt = timeit.default_timer() - t_start

           # ....

            if run == runs:
                # Print head / tail on last run
                logger.debug("Answer head:\n%s" % ans.head(3))
                logger.debug("Answer tail:\n%s" % ans.tail(3))
            del ans
    except Exception as err:
        logger.error("Query '%s' failed!" % question)
        print(err)

There are number of benefits to this approach:

  1. More DRY approach.
  2. Able to parameterize the number of runs per query (if you want to increase this in the future).
  3. Have programmatic access to the query runner for extensions :)

In addition to the primary changes stated above, there are a few other changes:

  1. Exception handling during queries to not fail all queries.
  2. Use of a python logger over print statements.
  3. Move of the __main__ guard from a wrapper to the main file itself (remove groupby-dask2.py)

@CarterFendley
Copy link
Author

@jangorecki @Tmonster Would like to solicit your feedback on these edits before I go too far down this path.

I fully intend to extend this approach to other dask (and python) tasks to hopefully help there too. But please let me know your opinions so I can make sure to make updates that reflect the needs of the project as well ☺️

@Tmonster
Copy link
Collaborator

Tmonster commented Feb 7, 2025

Thank you for the PR!

I'll have a look. I know we usually like to have the scripts formatted in such a way that it is easy to run them by hand using a script format, but these changes seems close enough. Also, dask has also been one of the more difficult solutions to get working, so am happy to have someone improve it.

Let me get a PR up that fixes the regression tests and I'll take a closer looks at everything

@CarterFendley
Copy link
Author

CarterFendley commented Feb 7, 2025

@Tmonster Awesome 😎 looking forward to your feedback!

100% understand the desire to run it like a script, I think that is a good idea, and it looks like the solution.R and ./run.sh work well in that format. I tried to ensure that the script is completely usable via CLI as well and seems to work with solution.R just the same as before.

Happy to take feedback then try to tackle join-dask.py as well!

@Tmonster
Copy link
Collaborator

Tmonster commented Feb 7, 2025

@CarterFendley can you rebase and push again? The regression tests should run this time and not produce errors everywhere 😅

@CarterFendley
Copy link
Author

@Tmonster Done, looks like it is waiting for approval to run the workflow

@jangorecki
Copy link

Main reason against those kind of ideas was, and for me personally still is, to be able to copy-paste line-by-line script into console. This eliminates extra surface where might be performance regressions, even if not now, then eventually in future. As I am not maintainer anymore, it is not my decision.

@Tmonster
Copy link
Collaborator

Looks good from my side. Eventually I'd like to get the back to a state where I can paste commands line by line. But it is already a step in the right direction for debugging and maintenance. In some future PR we can get it to line by line debugging

@Tmonster Tmonster merged commit ea8d1e9 into duckdblabs:main Feb 10, 2025
15 checks passed
@CarterFendley
Copy link
Author

Sounds good!

That's an interesting perspective Jan. I definitely want to prevent regressions. I was hoping that a single run_query(...) block would reduce surface area and hopefully provide for less regressions.

As I have mentioned, one of my main purposes is to be able to programmatically update things such as the runs parameters to run queries for 2 ... 10 ... 25 times and get a full distribution.

I am happy to instrument code coverage measurements or other things to help assure there are no regressions as you all see fit. Want to assure the stability of the benchmark!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants