Skip to content

Add ExistenceJoin support to Comet native execution #3881

@Shekharrajak

Description

@Shekharrajak

What is the problem the feature request solves?

Comet does not support ExistenceJoin, causing incorrect results for correlated IN subqueries combined with OR on Spark 4.0. Adding native ExistenceJoin support would allow Comet to handle these plans end-to-end and eliminate the mixed Spark/Comet execution that produces wrong results.

The following query returns incorrect results when Comet is enabled on Spark 4.0:

CREATE TEMPORARY VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2);
CREATE TEMPORARY VIEW t2(c1, c2) AS VALUES (0, 2), (0, 3);

SELECT * FROM t1 WHERE
  c1 IN (SELECT count(*) + 1 FROM t2 WHERE t2.c1 = t1.c1) OR
  c2 IN (SELECT count(*) - 1 FROM t2 WHERE t2.c1 = t1.c1);

Expected: (0, 1) and (1, 2) (both rows match via different OR branches) Actual with Comet: (1, 2) only (row (0, 1) is dropped)

The same query with a single IN (no OR) produces correct results.

Describe the potential solution

No response

Additional context

This only reproduces on Spark 4.0 because:

in-count-bug.sql test only exists in Spark 4.0
Spark 4.0 uses a new decorrelation path (decorrelateInnerQueryEnabledForExistsIn = true by default) that produces DomainJoin + ExistenceJoin plan structures
Spark 3.5 uses the old decorrelation which has the "count bug" (different expected behavior)

Metadata

Metadata

Assignees

Labels

area:expressionsExpression evaluationbugSomething isn't workingcorrectnessenhancementNew feature or requestpriority:criticalData corruption, silent wrong results, security issues

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions