Skip to content

[WIP][SPARK-56608][PYTHON] Migrate grouped/cogrouped map Arrow UDF verify checks into enforce_schema#55530

Draft
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56608
Draft

[WIP][SPARK-56608][PYTHON] Migrate grouped/cogrouped map Arrow UDF verify checks into enforce_schema#55530
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56608

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented Apr 24, 2026

What changes were proposed in this pull request?

Make ArrowBatchTransformer.enforce_schema the single entry point for Arrow UDF output schema enforcement, and migrate the grouped/cogrouped map Arrow UDF paths (SQL_COGROUPED_MAP_ARROW_UDF, SQL_GROUPED_MAP_ARROW_UDF, SQL_GROUPED_MAP_ARROW_ITER_UDF) to use it instead of the separate verify_arrow_table / verify_arrow_batch helpers plus manual reorder.

enforce_schema is generalized to:

  • Accept both pa.RecordBatch and pa.Table.
  • Add reorder_by_name: bool = True: name-based matching + reorder + rename (with RESULT_COLUMN_NAMES_MISMATCH) vs positional matching preserving input names.
  • Collect all missing/extra/type mismatches before raising (previously raised on first).
  • Raise PySparkRuntimeError with existing errorClasses (RESULT_COLUMN_NAMES_MISMATCH / RESULT_COLUMN_TYPES_MISMATCH / RESULT_COLUMN_SCHEMA_MISMATCH) instead of bare-string PySparkTypeError, matching what verify_arrow_result already did.

verify_arrow_table and verify_arrow_batch are deleted; their pa.Table / pa.RecordBatch instance check is inlined at the call site. verify_arrow_result remains only for SQL_ARROW_TABLE_UDF (out of scope for this PR — no benchmark yet).

Why are the changes needed?

Part of SPARK-55388 (Refactor PythonEvalType processing logic). Today output validation is split between verify_arrow_result (friendly errorClass errors) in worker.py and enforce_schema (bare f-string errors) in conversion.py. Consolidating behind enforce_schema gives one code path and one error convention, and drops the redundant "verify + manual reorder" in grouped-map paths.

Does this PR introduce any user-facing change?

Yes, minor: error messages for SQL_ARROW_UDTF (the pre-existing enforce_schema consumer) switch from bare f-strings to the same friendly errorClass-templated format already used by other Arrow UDFs. Error-class names and message formats for grouped/cogrouped map Arrow UDFs are unchanged.

How was this patch tested?

  • Existing integration tests in test_arrow_grouped_map.py / test_arrow_cogrouped_map.py already assert the errorClass-templated error format and pass unchanged.
  • Unit tests in test_conversion.py updated and extended for the new reorder_by_name, pa.Table input, and count-mismatch paths.
  • test_arrow_udtf.py regex updated for the two SQL_ARROW_UDTF error tests.
  • ASV benchmarks on CogroupedMapArrowUDFTimeBench, GroupedMapArrowUDFTimeBench, and GroupedMapArrowIterUDFTimeBench (repeat=3) vs upstream master: 52 parameter combinations, 0 regressions at -f 1.05.

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang marked this pull request as draft April 24, 2026 07:59
@Yicong-Huang Yicong-Huang changed the title [SPARK-56608][PYTHON] Migrate grouped/cogrouped map Arrow UDF verify checks into enforce_schema [WIP][SPARK-56608][PYTHON] Migrate grouped/cogrouped map Arrow UDF verify checks into enforce_schema Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant