Skip to content

Conversation

yhuang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to add a utility class to enable the evaluation of an InternalRow using a DSv2 predicate. In particular, it includes

  • converting dsv2 predicates to catalyst expressions
  • converting dsv2 expression to catalyst expressions
    • NamedReference -> BoundReference
    • LiteralValue -> catalyst Literal
  • Creating InterpretedPredicate and evaluate internalRow

Why are the changes needed?

This would be helpful for partition pruning, where the runtime filters are DSv2 predicates and the partitionValue are internalRows (for partitionFiles in Spark). In this way, partitionFiles can be pruned directly with DSv2 predicates at the scan level.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Oct 2, 2025
* @return Catalyst Expression representing the converted predicate, or empty if the predicate is
* unsupported or references unknown columns
*/
public static Optional<Expression> dsv2PredicateToCatalystExpression(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about convertV2PredicateToCatalyst

* @return Catalyst Expression representing the resolved expression, or empty if the expression is
* unsupported or references unknown columns
*/
public static Optional<Expression> dsv2ExpressionToCatalystExpression(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about convertV2ExpressionToCatalyst

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we can make it private for now

* predicate could not be converted
*/
public static Optional<Boolean> evaluateInternalRowOnDsv2Predicate(
org.apache.spark.sql.connector.expressions.filter.Predicate predicate,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why having the full class name here?

* @param predicate the DSV2 Predicate to evaluate
* @param internalRow the InternalRow to evaluate the predicate against
* @param schema the schema used for resolving column references in the predicate
* @return Optional containing the result of the evaluation (true or false), or empty if the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an optional boolean is a bit confusing. How about having a input parameter failOnError=false which controls the behavior of predicate conversion failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By introducing failOnError, do you mean the first behavior or the second?

failOnError = True failOnError = False
converted, satisfied True True
converted, unsatisfied False False
unconverted False True
failOnError = True failOnError = False
converted, satisfied True True
converted, unsatisfied False False
unconverted throw error True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second.
This is just a suggestion. IMO it is easier to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants