fix: RESPECT NULLS for Spark collect_list function#16933
fix: RESPECT NULLS for Spark collect_list function#16933yaooqinn wants to merge 2 commits intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Please move them to a separate PR
12a2fca to
8aae7ec
Compare
|
Rebased on latest main — removed the unrelated cudf changes from the PR diff. Thanks @jinchengchenghh! |
| std::vector<DecodedVector> inputDecoded_; | ||
| DecodedVector intermediateDecoded_; | ||
|
|
||
| protected: |
There was a problem hiding this comment.
The member order should be protected and then private
| @@ -48,8 +49,9 @@ class CollectListAggregate { | |||
There was a problem hiding this comment.
Remove the initialize function
Remove kSparkCollectListIgnoreNulls QueryConfig and replace with a constant boolean argument, matching the collect_set pattern from facebookincubator#16416. Closes facebookincubator#16839 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
8aae7ec to
a9de448
Compare
|
Addressed comments:
All 8 tests pass. Thanks @jinchengchenghh! |
rui-mo
left a comment
There was a problem hiding this comment.
Hi @yaooqinn, could you please open a corresponding PR for collect_set and collect_list in Gluten to verify their functionality? We are currently noticing the fallback issue below, and I suppose compatibility code needs to be added due to the signature change.
E20260323 21:52:57.248056 1774 Exceptions.h:87] Line: /work/cpp/velox/substrait/SubstraitToVeloxPlan.cc:290, Function:toAggregationFunctionName, Expression: signatures.has_value() && signatures.value().size() > 0 Cannot find function signature for collect_set_merge_extract_array_row_VARCHAR_BIGINT_BIGINT_endrow in final aggregation step., Source: RUNTIME, ErrorCode: INVALID_STATE
21:52:57.251 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: SortAggregate[QueryId=405], due to:
- Native validation failed:
|- Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1450 function:validate, thrown from file:SubstraitToVeloxPlan.cc line:290 function:toAggregationFunctionName, reason:Cannot find function signature for collect_set_merge_extract_array_row_VARCHAR_BIGINT_BIGINT_endrow in final aggregation step.
|
Created Gluten PR apache/gluten#11837 to test compatibility. It adds |
zhli1142015
left a comment
There was a problem hiding this comment.
Nice clean follow-up to the collect_set RESPECT NULLS pattern (#16416). The approach is consistent and the config removal is thorough. A few minor items — mainly a stale comment referencing the removed config, and a suggestion about the fn_ visibility change in SimpleAggregateAdapter.
| @@ -44,14 +45,6 @@ class CollectListAggregate { | |||
| // aggregation uses the accumulator path, which correctly respects the config. | |||
There was a problem hiding this comment.
This comment references "config" twice, but ignoreNulls_ is no longer read from QueryConfig — it now comes from the constant boolean argument via setConstantInputs(). Please update the wording, e.g.:
// NOTE: toIntermediate() was intentionally removed because it is static and
// cannot access the runtime ignoreNulls_ flag. Without it, partial
// aggregation uses the accumulator path, which correctly respects the flag.| // Velox registers a 2-arg collect_set(T, boolean) signature that Spark | ||
| // doesn't support. The fuzzer may pick this signature and fail. | ||
| "collect_set", | ||
| // Same as collect_set — 2-arg signature not supported by Spark. |
There was a problem hiding this comment.
Nit: "2-arg signature not supported by Spark" is slightly misleading — Spark 4.0+ does support RESPECT NULLS / IGNORE NULLS for collect_list (SPARK-55256). The real reason for skipping is that the fuzzer can't generate the constant boolean argument. Consider:
// Fuzzer may pick the 2-arg (T, boolean) signature which requires
// a constant boolean that the fuzzer cannot generate.
"collect_list",Same applies to the collect_set comment above.
| } | ||
| } | ||
|
|
||
| protected: |
There was a problem hiding this comment.
Making fn_ protected exposes the raw unique_ptr to all SimpleAggregateAdapter subclasses across the codebase. Currently only CollectListAdapter needs it. Would a protected accessor like FUNC& fn() { return *fn_; } be a tighter API contract? That way subclasses can access the function object without being able to reset/move the unique_ptr itself.
Not a blocker — just a suggestion for encapsulation.
| {"spark_collect_list(c0)"}, | ||
| expectedResult, | ||
| makeConfig(false)); | ||
| testAggregations({input}, {}, {"spark_collect_list(c0, false)"}, {expected}); |
There was a problem hiding this comment.
Consider adding a test that verifies the constant boolean false (RESPECT NULLS) works correctly through partial → intermediate → final aggregation stages. testAggregations() does cover multiple modes internally, but an explicit streaming/split test would increase confidence that setConstantInputs() propagates correctly across stages.
Build Impact AnalysisSelective Build Targets (building these covers all 407 affected)Total affected: 407/556 targets
Affected targets (407)Directly changed (296)
Transitively affected (111)
Fast path • Graph from main@4a966b2effd240f9d9f43e0b9305c0bdd7cd6b39 |
Summary
Follow-up to #16416 (collect_set RESPECT NULLS). Applies the same pattern to
collect_list:kSparkCollectListIgnoreNullsQueryConfigcollect_list(x [, ignoreNulls])signature with constant booleansetConstantInputsviaCollectListAdaptersubclass ofSimpleAggregateAdapterfn_protected inSimpleAggregateAdapterto enable subclass accessChanges
CollectListAggregate.cpp: Replace config-based with constant-arg approachQueryConfig.h: RemovekSparkCollectListIgnoreNullsSimpleAggregateAdapter.h: Makefn_protectedaggregate.rst: Update doc formatSparkAggregationFuzzerTest.cpp: Add to skipFunctionsTesting
All 8 CollectList tests pass.
Closes #16839