Remove the file name from the output in cudf-polars' explain APIs #18752

Matt711 · 2025-05-10T13:47:39Z

Description

Follow up to #18708 that addresses this comment.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rjzamora · 2025-05-12T13:20:51Z

Thanks for looking into this @Matt711 !

Unfortunately, this doesn't really address the "problem" I am seeing with large multi-file datasets. Even if we print the name of the file, I still see patterns like this:

    SELECT ('________________________________3',) [1]
      REPARTITION ('________________________________2',) [1]
        SELECT ('________________________________2',) [200]
          HSTACK ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate', '__POLARS_CSER_0x50629c41af563bce') [200]
            PROJECTION ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate') [200]
              JOIN Inner ('l_partkey',) ('p_partkey',) ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate', 'p_type') [200]
                UNION ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [200]
                  SCAN PARQUET lineitem_0002072d-7283-43ae-b645-b26640318053.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  SCAN PARQUET lineitem_00da60cf-7b34-4715-800a-5032bc9cb829.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  SCAN PARQUET lineitem_02997ef6-3e9b-4ab8-883c-ffc32da0d133.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  SCAN PARQUET lineitem_03e96325-223f-4e27-9149-d37658f052fa.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  SCAN PARQUET lineitem_04a90cbd-2d52-48f8-9a86-e8b44e2888ad.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  ... THIS REPEATS 195 more times (with slightly different file names) ...

rjzamora · 2025-05-12T13:31:14Z

My weakly-held opinion is that we get enough information from Scan.typ and the schema if we do something like:

@_repr_ir.register
def _(ir: Scan, *, offset: str = "") -> str:
    label = f"SCAN {ir.typ.upper()}"
    return _repr_header(offset, label, ir.schema)

This way, we end up with patterns like:

    SELECT ('________________________________3',) [1]
      REPARTITION ('________________________________2',) [1]
        SELECT ('________________________________2',) [200]
          HSTACK ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate', '__POLARS_CSER_0x50629c41af563bce') [200]
            PROJECTION ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate') [200]
              JOIN Inner ('l_partkey',) ('p_partkey',) ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate', 'p_type') [200]
                UNION ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [200]
                  SCAN PARQUET ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  (repeated 200 times)
                UNION ('p_partkey', 'p_type') [2]
                  SCAN PARQUET ('p_partkey', 'p_type') [1]
                  (repeated 2 times)

Matt711 · 2025-05-12T13:39:32Z

UNION ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [200]
SCAN PARQUET lineitem_0002072d-7283-43ae-b645-b26640318053.parquet ... ('l_partkey', 'l_extendedprice', 'l_discoun

Thanks for print out! I see, yeah I'llremove the file name and suffix.

Matt711 · 2025-05-12T13:46:15Z

My weakly-held opinion is that we get enough information from Scan.typ and the schema if we do something like:

@_repr_ir.register
def _(ir: Scan, *, offset: str = "") -> str:
    label = f"SCAN {ir.typ.upper()}"
    return _repr_header(offset, label, ir.schema)

This way, we end up with patterns like:

    SELECT ('________________________________3',) [1]
      REPARTITION ('________________________________2',) [1]
        SELECT ('________________________________2',) [200]
          HSTACK ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate', '__POLARS_CSER_0x50629c41af563bce') [200]
            PROJECTION ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate') [200]
              JOIN Inner ('l_partkey',) ('p_partkey',) ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate', 'p_type') [200]
                UNION ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [200]
                  SCAN PARQUET ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  (repeated 200 times)
                UNION ('p_partkey', 'p_type') [2]
                  SCAN PARQUET ('p_partkey', 'p_type') [1]
                  (repeated 2 times)

Yeah I think for the purposes of pdsh, the schema will be enough to distinguish. Maybe I can follow-up and add a show_paths kwarg (default to False)?

rjzamora · 2025-05-12T13:56:46Z

Yeah I think for the purposes of pdsh, the schema will be enough to distinguish. Maybe I can follow-up and add a show_paths kwarg (default to False)?

Yeah - I agree we may end up with applications that benefit from the path information, but you're right that we don't need this to understand PDSH yet. I have a feeling we can capture this information without any optional arguments, but I'm not entirely sure yet (maybe we go back to special handling for UNION?).

rjzamora

You probably need to adjust test_explain_logical_plan_wide_table_with_scan for CI to pass, but LGTM once CI is green. Thanks!

Matt711 · 2025-05-12T14:21:17Z

You probably need to adjust test_explain_logical_plan_wide_table_with_scan for CI to pass, but LGTM once CI is green. Thanks!

Thanks!

…clude-first-path-only

Matt711 · 2025-05-12T19:05:45Z

/merge

Include the name of the first path only in cudf-polars explain API

f4c306c

Matt711 requested a review from a team as a code owner May 10, 2025 13:47

Matt711 added the improvement Improvement / enhancement to an existing function label May 10, 2025

Matt711 added this to cuDF Python May 10, 2025

Matt711 added the non-breaking Non-breaking change label May 10, 2025

Matt711 requested review from wence- and rjzamora May 10, 2025 13:47

github-actions bot assigned Matt711 May 10, 2025

github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels May 10, 2025

GPUtester moved this to In Progress in cuDF Python May 10, 2025

Matt711 mentioned this pull request May 10, 2025

Extend explain_query to support printing the logical plan (pre lowered plan) #18708

Merged

3 tasks

Remove the file name

fb9e2c7

Matt711 changed the title ~~Include the name of the first path only in cudf-polars explain APIs~~ Remove the file name from the output in cudf-polars' explain APIs May 12, 2025

rjzamora approved these changes May 12, 2025

View reviewed changes

fix test

3cba4c7

Merge branch 'branch-25.06' into imp/polars/streaming/explain-scan-in…

204a48c

…clude-first-path-only

github-actions bot assigned rjzamora May 12, 2025

Merge branch 'branch-25.06' into imp/polars/streaming/explain-scan-in…

5c5cc28

…clude-first-path-only

rapids-bot bot merged commit 0213864 into rapidsai:branch-25.06 May 12, 2025
123 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove the file name from the output in cudf-polars' explain APIs #18752

Remove the file name from the output in cudf-polars' explain APIs #18752

Uh oh!

Matt711 commented May 10, 2025 •

edited

Loading

Uh oh!

rjzamora commented May 12, 2025

Uh oh!

rjzamora commented May 12, 2025

Uh oh!

Matt711 commented May 12, 2025

Uh oh!

Matt711 commented May 12, 2025

Uh oh!

rjzamora commented May 12, 2025

Uh oh!

rjzamora left a comment

Uh oh!

Matt711 commented May 12, 2025

Uh oh!

Matt711 commented May 12, 2025

Uh oh!

Uh oh!

Uh oh!

Remove the file name from the output in cudf-polars' explain APIs #18752

Remove the file name from the output in cudf-polars' explain APIs #18752

Uh oh!

Conversation

Matt711 commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

rjzamora commented May 12, 2025

Uh oh!

rjzamora commented May 12, 2025

Uh oh!

Matt711 commented May 12, 2025

Uh oh!

Matt711 commented May 12, 2025

Uh oh!

rjzamora commented May 12, 2025

Uh oh!

rjzamora left a comment

Choose a reason for hiding this comment

Uh oh!

Matt711 commented May 12, 2025

Uh oh!

Matt711 commented May 12, 2025

Uh oh!

Uh oh!

Uh oh!

Matt711 commented May 10, 2025 •

edited

Loading