Skip to content

Conversation

@peter-toth
Copy link
Contributor

What changes were proposed in this pull request?

This PR extracts the plan merging logic from MergeScalarSubqueries to PlanMerger so as to other rules can reuse it.

While the plan merging logic is extracted without modification to PlanMerger, MergeScalarSubqueries required a significant adjustment. This is because SPARK-40618 / #38093 added subquery reference tracking so as to avoid trying to merge a subquery to any of its nested subqueries. This kind of reference trancking doesn't work well with a general PlanMerger so this PR modifies MergeScalarSubqueries to use a separate PlanMergers by each subquery level.

Why are the changes needed?

To be able to reuse plan merging logic.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

Was this patch authored or co-authored using generative AI tooling?

Yes, Claude gave me suggestions to improve documentation.

@github-actions github-actions bot added the SQL label Nov 2, 2025
* - outputMap maps newPlan's attributes to mergedPlan's attributes
* Returns None if plans cannot be merged.
*/
private def tryMergePlans(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the first glance, this looks identical in terms of the code. Am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tryMergePlans(), mapAttributes(), mergeNamedExpressions() and supportedAggregateMerge() are just copied without any changes.

if (mergedPlan.merged) {
CTERelationDef(
createProject(header.plan.output, removeReferences(header.plan, cache)),
Project(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why this PR proposes to remove the existing createProject method and use Project(...) directly here? Is this better in a way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current version of the rule createProject() is a small helper function and it is used only here, so I felt it doesn't need to be a separate function.


final override val nodePatterns: Seq[TreePattern] = Seq(SCALAR_SUBQUERY_REFERENCE)

override def stringArgs: Iterator[Any] = Iterator(subqueryIndex, headerIndex, dataType, exprId.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why we don't need to override stringArgs at all after this PR? Is this related to this PR's contribution or this was not used before this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScalarSubqueryReference never shows up in any stringified plan. It's just a temporary object that is added and then removed when MergeScalarSubqueries runs so I think we don't need this.


// Caching returns the index of the subquery in the cache and the index of scalar member in the
// "Header".
private def cacheSubquery(plan: LogicalPlan, cache: ArrayBuffer[Header]): (Int, Int) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method removed simply?

Copy link
Contributor Author

@peter-toth peter-toth Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly, cacheSubquery() became PlanMerger.merge() with some modifications.

As PlanMerger.merge() is kind of an internal API now, which can be used by other rules. It returns MergeResult instead of the tuple after this refactor.
The Header objects, which contined the information about the merged plans became MergedPlans and can be accessed as PlanMerger.mergedPlans() by the rules.

The main difference between cacheSubquery() and the new PlanMerger.merge() is that I removed Header.references and the related safeguards like the if !references.contains(subqueryIndex) below.
Previously, we filled up Header.references and with nested subquery indices so as to avoid trying to merge a subquery with other subqueries that are nested into it.
E.g.

SELECT (
  SELECT avg(a) FROM t WHERE c = (    -- subquery 1
    SELECT sum(b) FROM t2)            -- subquery 2 nested into subquery 1
  )
)

Here it doesn't make sense to try merging the plan of subquery 1 and subquery 2 as the former contains the latter. The problem of trying to merge the 2 was not just it doesn't make sense, but in certain weird cases (see the example in SPARK-40618) the merge was successful and resulted in invalid plans.

But because PlanMerger should be a general plan merging tool and in some future rules it will be used to merge other than subqueries, tracking the nested subqueries in it is not the best way to deal with the problem.
This is why the refactored MergeScalarSubqueries uses a different approach. We have one PlanMerger object per each subquery level. E.g. subquery 2 is a level 0 (leaf) subquery, but subquery 1 is a level 1 subquery because it has an inner, level 0 subquery. So the 2 will be added to different PlanMergers. This way a level n subquery is tried to merge with other level n subqueries only.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cloud-fan and @dtenedor from the previous PR (mentioned in the PR description).

BTW, @peter-toth , given the characteristic of this PR, it seems that we need to target Apache Spark 4.2.0 only (master branch) instead of 4.1.0 (branch-4.1).

val subquery5 = ScalarSubquery(testRelation.select((Symbol("a") + 2).as("a_plus2_2")))
val subquery6 = ScalarSubquery(testRelation.select(Symbol("b").as("b_2")))
val originalQuery = testRelation
.select(
Copy link
Contributor Author

@peter-toth peter-toth Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This minor test chnage in the order of subqueries is needed because the modified MergeScalarSubqueries encounters them in different order. So we either change the expected query to expect a_plus1_2 and a_plus2_2 or reorder the subqueries in the query like here. I applied the latter.

@peter-toth
Copy link
Contributor Author

cc @beliefer as well.

The goal here is to extract the merging logic and reuse it in SPARK-44571 / #42223 and SPARK-43025 / #40661.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants