Skip to content

feat: extended expression builders #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 30, 2025

Conversation

tokoko
Copy link
Contributor

@tokoko tokoko commented Mar 15, 2025

Adds extended expression builder functions: column, literal and scalar_function

Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@tokoko tokoko changed the title feat: simple field reference extended expression builder feat: extended expression builders Mar 22, 2025
elif kind == "string":
literal = stalg.Expression.Literal(string=value, nullable=type.string.nullability == stp.Type.NULLABILITY_NULLABLE)
else:
raise Exception(f"Unknown literal type - {type}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some unimplemented types here (such as dates, times, binary, uuid, etc.). Is it worth providing guidance here as to how to request/implement support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point. I'll follow up with those implementations shortly and let's discuss this in that PR, if that's okay. Honestly I'm not sure how a literal builder for a "complex" type (something like IntervalCompound for example) should look like.


UnboundExpression = Callable[[stp.NamedStruct, FunctionRegistry], stee.ExtendedExpression]

def literal(value: Any, type: stp.Type, alias: str = None) -> UnboundExpression:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to create a literal builder that works for both expressions and extended expressions (avoiding code duplication)? Perhaps we need to add more support to the other builder parts to accept either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a simple Expression builder makes a lot of sense, especially for literals. I'm simply trying to keep things uniform for starters as I don't have a clear idea how to tackle Expression builders for non-literal types.


return resolve

def column(name: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working over a name instead of a column number. Substrait doesn't need names but I can see this being useful as a convenience. There should be a version that accepts a field number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added field number resolver as well.


UnboundExpression = Callable[[stp.NamedStruct, FunctionRegistry], stee.ExtendedExpression]

def literal(value: Any, type: stp.Type, alias: str = None) -> UnboundExpression:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments for these methods (for instance, note how names are used).


def column(name: str):
def resolve(base_schema: stp.NamedStruct, registry: FunctionRegistry) -> stee.ExtendedExpression:
column_index = list(base_schema.names).index(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this resolution logic be handled in a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll factor out the logic in a separate util as soon as another use for it appears. Does that sound okay?


return resolve

def scalar_function(uri: str, function: str, *expressions: UnboundExpression, alias: str = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this accept regular expressions and extended expressions as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 3 possible types to be used here: Expression, ExtendedExpression and UnboundExpression. My thinking about pros/cons of each option is:

  • The downside of using a regular proto Expression type is that Expressions can't carry additional information regarding resources that are used in the expression tree, meaning that if you use an extension function or a cte (once this substrait PR is merged) that context is thrown away with a regular Expression type.

    The same applies to a more Rel-level builders as well, for example a possible join builder can have one of these following signatures: join(left: Rel, right: Rel, join_type: str, condition: Expression) or join(left: Plan, right: Plan, join_type: str, condition: ExtendedExpression). I think we should prefer the second one because it allows for each object to independently hold full context of the extenstions/ctes used while building each object and allows them to be merged by the builder function later on.

  • ExtendedExpression is perfectly okay to use here with the only downside being that in order to build an ExtendedExpression the user needs to be aware of precisely in what context the expression is supposed to be used, in other words once ExtendedExpression is built, you've already tied it to a specific Rel schema where it can be used.

  • UnboundExpression is the currying equivalent of ExtendedExpression which allows the user to defer resolving precise Rel schema until later when it's actually used in the context of a specific Rel.

I guess we can probably introduce a type like Union[Expression, ExtendedExpression, UnboundExpression] and use it in similar builder signatures, if we want to cover as many use cases as possible. We will have to then do some light type matching on the builder side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case I'd like to enable is modifying an existing plan. I suppose you could convert an existing plan into a series of extended expressions. We don't need to do this right away though. A set of builders that work from scratch is a reasonable place to start and we can look at the other use cases later.

else:
return 1

def merge_extension_uris(*extension_uris: Iterable[ste.SimpleExtensionURI]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment explaining what merge does (maintains order but removes duplicates) would be useful here.

@EpsilonPrime EpsilonPrime merged commit 1e44dd8 into substrait-io:main Mar 30, 2025
17 checks passed
@tokoko tokoko deleted the extended-expression-builders branch March 30, 2025 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants