[WIP][SPARK-56608][PYTHON] Migrate grouped/cogrouped map Arrow UDF verify checks into enforce_schema#55530
Draft
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Draft
[WIP][SPARK-56608][PYTHON] Migrate grouped/cogrouped map Arrow UDF verify checks into enforce_schema#55530Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Conversation
…/cogrouped map Arrow UDF paths
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Make
ArrowBatchTransformer.enforce_schemathe single entry point for Arrow UDF output schema enforcement, and migrate the grouped/cogrouped map Arrow UDF paths (SQL_COGROUPED_MAP_ARROW_UDF,SQL_GROUPED_MAP_ARROW_UDF,SQL_GROUPED_MAP_ARROW_ITER_UDF) to use it instead of the separateverify_arrow_table/verify_arrow_batchhelpers plus manual reorder.enforce_schemais generalized to:pa.RecordBatchandpa.Table.reorder_by_name: bool = True: name-based matching + reorder + rename (withRESULT_COLUMN_NAMES_MISMATCH) vs positional matching preserving input names.PySparkRuntimeErrorwith existingerrorClasses (RESULT_COLUMN_NAMES_MISMATCH/RESULT_COLUMN_TYPES_MISMATCH/RESULT_COLUMN_SCHEMA_MISMATCH) instead of bare-stringPySparkTypeError, matching whatverify_arrow_resultalready did.verify_arrow_tableandverify_arrow_batchare deleted; theirpa.Table/pa.RecordBatchinstance check is inlined at the call site.verify_arrow_resultremains only forSQL_ARROW_TABLE_UDF(out of scope for this PR — no benchmark yet).Why are the changes needed?
Part of SPARK-55388 (Refactor PythonEvalType processing logic). Today output validation is split between
verify_arrow_result(friendlyerrorClasserrors) inworker.pyandenforce_schema(bare f-string errors) inconversion.py. Consolidating behindenforce_schemagives one code path and one error convention, and drops the redundant "verify + manual reorder" in grouped-map paths.Does this PR introduce any user-facing change?
Yes, minor: error messages for
SQL_ARROW_UDTF(the pre-existingenforce_schemaconsumer) switch from bare f-strings to the same friendlyerrorClass-templated format already used by other Arrow UDFs. Error-class names and message formats for grouped/cogrouped map Arrow UDFs are unchanged.How was this patch tested?
test_arrow_grouped_map.py/test_arrow_cogrouped_map.pyalready assert theerrorClass-templated error format and pass unchanged.test_conversion.pyupdated and extended for the newreorder_by_name,pa.Tableinput, and count-mismatch paths.test_arrow_udtf.pyregex updated for the twoSQL_ARROW_UDTFerror tests.CogroupedMapArrowUDFTimeBench,GroupedMapArrowUDFTimeBench, andGroupedMapArrowIterUDFTimeBench(repeat=3) vs upstream master: 52 parameter combinations, 0 regressions at-f 1.05.Was this patch authored or co-authored using generative AI tooling?
No.