Programmatic access to queries & modularity #106

CarterFendley · 2025-02-07T03:00:38Z

This PR is intended to start a discussion on how to re-use as much of the benchmark code as possible. Specifically, this proposes the creation of a function similar to run_query(...) as prototyped in this commit editing dask groupby query logic:

def run_query(
    data_name: str,
    in_rows: int,
    x: dd.DataFrame,
    query: Query,
    question: str,
    runs: int = 2,
):
    logger.info("Running query: '%s'" % question)
    try:
        for run in range(1, runs+1):
            gc.collect() # TODO: Able to do this in worker processes? Want to?

            # Calculate ans
            t_start = timeit.default_timer()
            ans = query.query(x)
            logger.debug("Answer shape: %s" % (ans.shape, ))
            t = timeit.default_timer() - t_start
            m = memory_usage()

            # Calculate chk
            t_start = timeit.default_timer()
            chk = query.check(ans)
            chkt = timeit.default_timer() - t_start

           # ....

            if run == runs:
                # Print head / tail on last run
                logger.debug("Answer head:\n%s" % ans.head(3))
                logger.debug("Answer tail:\n%s" % ans.tail(3))
            del ans
    except Exception as err:
        logger.error("Query '%s' failed!" % question)
        print(err)

There are number of benefits to this approach:

More DRY approach.
Able to parameterize the number of runs per query (if you want to increase this in the future).
Have programmatic access to the query runner for extensions :)

In addition to the primary changes stated above, there are a few other changes:

Exception handling during queries to not fail all queries.
Use of a python logger over print statements.
Move of the __main__ guard from a wrapper to the main file itself (remove groupby-dask2.py)

CarterFendley · 2025-02-07T03:04:00Z

@jangorecki @Tmonster Would like to solicit your feedback on these edits before I go too far down this path.

I fully intend to extend this approach to other dask (and python) tasks to hopefully help there too. But please let me know your opinions so I can make sure to make updates that reflect the needs of the project as well ☺️

Tmonster · 2025-02-07T07:29:50Z

Thank you for the PR!

I'll have a look. I know we usually like to have the scripts formatted in such a way that it is easy to run them by hand using a script format, but these changes seems close enough. Also, dask has also been one of the more difficult solutions to get working, so am happy to have someone improve it.

Let me get a PR up that fixes the regression tests and I'll take a closer looks at everything

CarterFendley · 2025-02-07T07:44:15Z

@Tmonster Awesome 😎 looking forward to your feedback!

100% understand the desire to run it like a script, I think that is a good idea, and it looks like the solution.R and ./run.sh work well in that format. I tried to ensure that the script is completely usable via CLI as well and seems to work with solution.R just the same as before.

Happy to take feedback then try to tackle join-dask.py as well!

Tmonster · 2025-02-07T11:23:43Z

@CarterFendley can you rebase and push again? The regression tests should run this time and not produce errors everywhere 😅

CarterFendley · 2025-02-08T02:32:26Z

@Tmonster Done, looks like it is waiting for approval to run the workflow

jangorecki · 2025-02-09T06:02:50Z

Main reason against those kind of ideas was, and for me personally still is, to be able to copy-paste line-by-line script into console. This eliminates extra surface where might be performance regressions, even if not now, then eventually in future. As I am not maintainer anymore, it is not my decision.

Tmonster · 2025-02-10T07:55:01Z

Looks good from my side. Eventually I'd like to get the back to a state where I can paste commands line by line. But it is already a step in the right direction for debugging and maintenance. In some future PR we can get it to line by line debugging

CarterFendley · 2025-02-10T15:02:33Z

Sounds good!

That's an interesting perspective Jan. I definitely want to prevent regressions. I was hoping that a single run_query(...) block would reduce surface area and hopefully provide for less regressions.

As I have mentioned, one of my main purposes is to be able to programmatically update things such as the runs parameters to run queries for 2 ... 10 ... 25 times and get a full distribution.

I am happy to instrument code coverage measurements or other things to help assure there are no regressions as you all see fit. Want to assure the stability of the benchmark!

CarterFendley added 3 commits February 7, 2025 10:43

Add ability for programtic execution for dask groupby

66069d7

Main guard now included in primary group by dask file

3ae8cb8

Adopt a logger

fbde233

CarterFendley force-pushed the carter/programmatic branch from 583bd5d to fbde233 Compare February 7, 2025 15:43

Tmonster merged commit ea8d1e9 into duckdblabs:main Feb 10, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Programmatic access to queries & modularity #106

Programmatic access to queries & modularity #106

Uh oh!

CarterFendley commented Feb 7, 2025

Uh oh!

CarterFendley commented Feb 7, 2025

Uh oh!

Tmonster commented Feb 7, 2025

Uh oh!

CarterFendley commented Feb 7, 2025 •

edited

Loading

Uh oh!

Tmonster commented Feb 7, 2025 •

edited

Loading

Uh oh!

CarterFendley commented Feb 8, 2025

Uh oh!

jangorecki commented Feb 9, 2025

Uh oh!

Tmonster commented Feb 10, 2025

Uh oh!

Uh oh!

CarterFendley commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Programmatic access to queries & modularity #106

Programmatic access to queries & modularity #106

Uh oh!

Conversation

CarterFendley commented Feb 7, 2025

Uh oh!

CarterFendley commented Feb 7, 2025

Uh oh!

Tmonster commented Feb 7, 2025

Uh oh!

CarterFendley commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tmonster commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CarterFendley commented Feb 8, 2025

Uh oh!

jangorecki commented Feb 9, 2025

Uh oh!

Tmonster commented Feb 10, 2025

Uh oh!

Uh oh!

CarterFendley commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CarterFendley commented Feb 7, 2025 •

edited

Loading

Tmonster commented Feb 7, 2025 •

edited

Loading