Skip to content

Remove the file name from the output in cudf-polars' explain APIs #18752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented May 10, 2025

Description

Follow up to #18708 that addresses this comment.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 requested a review from a team as a code owner May 10, 2025 13:47
@Matt711 Matt711 added the improvement Improvement / enhancement to an existing function label May 10, 2025
@Matt711 Matt711 added the non-breaking Non-breaking change label May 10, 2025
@Matt711 Matt711 requested review from wence- and rjzamora May 10, 2025 13:47
@github-actions github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars labels May 10, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python May 10, 2025
@rjzamora
Copy link
Member

Thanks for looking into this @Matt711 !

Unfortunately, this doesn't really address the "problem" I am seeing with large multi-file datasets. Even if we print the name of the file, I still see patterns like this:

    SELECT ('________________________________3',) [1]
      REPARTITION ('________________________________2',) [1]
        SELECT ('________________________________2',) [200]
          HSTACK ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate', '__POLARS_CSER_0x50629c41af563bce') [200]
            PROJECTION ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate') [200]
              JOIN Inner ('l_partkey',) ('p_partkey',) ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate', 'p_type') [200]
                UNION ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [200]
                  SCAN PARQUET lineitem_0002072d-7283-43ae-b645-b26640318053.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  SCAN PARQUET lineitem_00da60cf-7b34-4715-800a-5032bc9cb829.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  SCAN PARQUET lineitem_02997ef6-3e9b-4ab8-883c-ffc32da0d133.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  SCAN PARQUET lineitem_03e96325-223f-4e27-9149-d37658f052fa.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  SCAN PARQUET lineitem_04a90cbd-2d52-48f8-9a86-e8b44e2888ad.parquet ... ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  ... THIS REPEATS 195 more times (with slightly different file names) ...

@rjzamora
Copy link
Member

My weakly-held opinion is that we get enough information from Scan.typ and the schema if we do something like:

@_repr_ir.register
def _(ir: Scan, *, offset: str = "") -> str:
    label = f"SCAN {ir.typ.upper()}"
    return _repr_header(offset, label, ir.schema)

This way, we end up with patterns like:

    SELECT ('________________________________3',) [1]
      REPARTITION ('________________________________2',) [1]
        SELECT ('________________________________2',) [200]
          HSTACK ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate', '__POLARS_CSER_0x50629c41af563bce') [200]
            PROJECTION ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate') [200]
              JOIN Inner ('l_partkey',) ('p_partkey',) ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate', 'p_type') [200]
                UNION ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [200]
                  SCAN PARQUET ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  (repeated 200 times)
                UNION ('p_partkey', 'p_type') [2]
                  SCAN PARQUET ('p_partkey', 'p_type') [1]
                  (repeated 2 times)

@Matt711
Copy link
Contributor Author

Matt711 commented May 12, 2025

UNION ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [200]
SCAN PARQUET lineitem_0002072d-7283-43ae-b645-b26640318053.parquet ... ('l_partkey', 'l_extendedprice', 'l_discoun

Thanks for print out! I see, yeah I'llremove the file name and suffix.

@Matt711
Copy link
Contributor Author

Matt711 commented May 12, 2025

My weakly-held opinion is that we get enough information from Scan.typ and the schema if we do something like:

@_repr_ir.register
def _(ir: Scan, *, offset: str = "") -> str:
    label = f"SCAN {ir.typ.upper()}"
    return _repr_header(offset, label, ir.schema)

This way, we end up with patterns like:

    SELECT ('________________________________3',) [1]
      REPARTITION ('________________________________2',) [1]
        SELECT ('________________________________2',) [200]
          HSTACK ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate', '__POLARS_CSER_0x50629c41af563bce') [200]
            PROJECTION ('l_partkey', 'l_extendedprice', 'l_discount', 'p_type', 'l_shipdate') [200]
              JOIN Inner ('l_partkey',) ('p_partkey',) ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate', 'p_type') [200]
                UNION ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [200]
                  SCAN PARQUET ('l_partkey', 'l_extendedprice', 'l_discount', 'l_shipdate') [1]
                  (repeated 200 times)
                UNION ('p_partkey', 'p_type') [2]
                  SCAN PARQUET ('p_partkey', 'p_type') [1]
                  (repeated 2 times)

Yeah I think for the purposes of pdsh, the schema will be enough to distinguish. Maybe I can follow-up and add a show_paths kwarg (default to False)?

@Matt711 Matt711 changed the title Include the name of the first path only in cudf-polars explain APIs Remove the file name from the output in cudf-polars' explain APIs May 12, 2025
@rjzamora
Copy link
Member

Yeah I think for the purposes of pdsh, the schema will be enough to distinguish. Maybe I can follow-up and add a show_paths kwarg (default to False)?

Yeah - I agree we may end up with applications that benefit from the path information, but you're right that we don't need this to understand PDSH yet. I have a feeling we can capture this information without any optional arguments, but I'm not entirely sure yet (maybe we go back to special handling for UNION?).

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to adjust test_explain_logical_plan_wide_table_with_scan for CI to pass, but LGTM once CI is green. Thanks!

@Matt711
Copy link
Contributor Author

Matt711 commented May 12, 2025

You probably need to adjust test_explain_logical_plan_wide_table_with_scan for CI to pass, but LGTM once CI is green. Thanks!

Thanks!

@Matt711
Copy link
Contributor Author

Matt711 commented May 12, 2025

/merge

@rapids-bot rapids-bot bot merged commit 0213864 into rapidsai:branch-25.06 May 12, 2025
123 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants