Skip to content

Conversation

@tokoko
Copy link
Contributor

@tokoko tokoko commented Oct 18, 2025

changes builder api for all functions so one can easily pass multiple function refs to the builder.

@github-actions
Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@tokoko
Copy link
Contributor Author

tokoko commented Oct 18, 2025

  • the builder will try looking up each function in the registry and throw an exception if none of them matches. this is practical because substrait often splits logically related functions in separate extensions, for example ["functions_arithmetic.yaml:add", "functions_arithmetic_decimal.yaml:add"] while the user often doesn't care which one is used as long as the one with correct input types can be located.
  • this should also make transition to urns a bit simpler as uri arg will no longer be part of the api.

it's a breaking change, but easy enough to change even if someone already actually depends on this stuff.

@nielspardon
Copy link
Member

I'm wondering whether we should have a definition on the Substrait spec level on how to handle function extension merging / prioritization / resolution. @benbellick recently introduced a change in substrait-java/isthmus that uses a priority order for extension YAML files in order to resolve duplicate function signatures across files but only when mapping to/from Calcite. I'm not sure how all the other Substrait implementations handle this. I assume that it would be good to have some consistency on this aspect. What do you think?

@tokoko
Copy link
Contributor Author

tokoko commented Oct 28, 2025

as far as I understand, prioritization is an issue only if you try to look for functions by name only, right? The way ExtensionRegistry in substait-python is implemented right now, it always asks for uri/urn and name to locate the function, so there is not really a place for duplicates as long as extension files themselves are valid and urns are in fact unique.

The design proposed in this PR expects input to look something like this -> ["functions_arithmetic.yaml:add", "functions_arithmetic_decimal.yaml:add"] (or urn equivalent once the other PR is merged). We can of course also have an option of searching by name only (add) and the prioritization will come into picture, but we currently don't have anything like that.

@nielspardon
Copy link
Member

nielspardon commented Oct 28, 2025

right, these are different yet similar approaches.

since Isthmus may start from a SQL string it might only have the name and it would need to find the right function across a set of YAML file by identifying the function with the matching function signature. Then if you had multiple functions with the same signature the YAML file priority comes in.

In substrait-python you identify the function by URN/URI and function name and you try to allow defining which ones to pick from. Which also allows one to define priority on the function level since I guess it would pick the first one that matches the data types.

@tokoko tokoko marked this pull request as draft October 30, 2025 15:00
@tokoko tokoko closed this Oct 30, 2025
@tokoko tokoko force-pushed the multiple-functions branch from f2a4550 to 890f84b Compare October 30, 2025 15:01
@tokoko tokoko reopened this Oct 30, 2025
@tokoko tokoko marked this pull request as ready for review October 30, 2025 20:15
@tokoko
Copy link
Contributor Author

tokoko commented Oct 30, 2025

This is back on the market after incorporating urn changes. The api for single function now looks like this -> scalar_function("extension:io.substrait:functions_comparison:gte", expressions=[...]). Although the initial intent was simply to concatenate extension urn and function name, looking at it now.. it feels like the resulting string is almost like a function urn of sorts. In that case, would something like extension_function:io.substrait:functions_comparison:gte make more sense? just an idea, curious what you think @benbellick

@benbellick
Copy link
Member

@nielspardon I felt that this change in substrait-java was necessary specifically because it is indeed valid substrait to have multiple functions with the same implementation but different urns. However, the calcite conversion caused an expected exception to be thrown if that was the case. Tt seemed like a questionable API to have valid substrait YAML files inadvertently cause a crash in a specific usage of the library. That being said, I am in no way wed to the resolution strategy I used there. I honestly chose it because it was simplest to implement, not because it was "best".

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment

def scalar_function(
urn: str,
function: str,
function: Union[str, Iterable[str]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tokoko I am open minded to this approach, but I would strongly prefer not introducing string parsing if it is possible. Instead, maybe the API could be something like:

Suggested change
function: Union[str, Iterable[str]],
function: Union[ExtensionID, Iterable[ExtensionID]],

where

@dataclass
class ExtensionID:
  urn: str
  function: str

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be open minded to just having a separate function for the individual and the list case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of parsing strings either. I simply wanted not to clutter the api too much. How about using NamedTuple instead of a dataclass. It would allow the user to pass plain tuples as well.

from typing import NamedTuple, Union, Iterable

class ExtensionFunction(NamedTuple):
    urn: str
    function: str

def process_func(func: Union[ExtensionFunction, Iterable[ExtensionFunction]]):
    functions = [func] if isinstance(func[0], str) else func
    for f in functions:
        urn, name = f
        print(urn)
        print(name)

process_func(ExtensionFunction("sample_urn", "sample_func"))
process_func(("sample_urn", "sample_func"))
process_func([("sample_urn", "sample_func1"), ("sample_urn", "sample_func2")])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with that approach as an API for allowing multiple parameters, though I still have my hesitations about the PR as a whole as expressed in the below comment.

@benbellick benbellick self-requested a review October 31, 2025 18:50
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further thought, I am not sure I see the value of this. IIUC, what is happening is we find the first function with matching implementation in the list and use that one for extended_expression construction?

To be this seems to be a lot less explicit than the current implementation.

You write:

while the user often doesn't care which one is used as long as the one with correct input types can be located.

Is that really true in general? The point of having different functions is to have different semantics after all, and I think the user should be expected to be as clear as possible which implementation they intend on using, no?

If we wanted to offer a facility for looking up the first matching function by name and signature irrespective of uri/urn, then I could imagine something like that in the ExtensionRegistry.

If you feel strongly that this feature would be useful, then to me it makes more sense to exist as a separate helper function, rather than the canonical builder function. What do you think?

@tokoko
Copy link
Contributor Author

tokoko commented Oct 31, 2025

Is that really true in general? The point of having different functions is to have different semantics after all, and I think the user should be expected to be as clear as possible which implementation they intend on using, no?

The immediate use case for this is sql conversion. + from sql can map to either an add in functions_arithmetic or functions_arithmetic_decimal. I think using these sorts of arithmetic functions regardless of the types involved will be a pretty common occurrence. It's not just sql. If I remember correctly, the same was true in ibis to substrait conversion code as well. ibis didn't distinguish between them, while substrait did. My point is that this sort of "go through multiple functions and see which one fits" is bound to be implemented somewhere regardless.

If you feel strongly that this feature would be useful, then to me it makes more sense to exist as a separate helper function, rather than the canonical builder function. What do you think?

I don't disagree that it's a "helper", but to be honest I sort of treated builders as "helpers" rather than strictly canonical in the first place. For example there are ways to provide columns by name (rather than an index) to a projection which is hardly canonical for substrait. I'm not trying to use that as an argument, though... maybe we should in fact have a clearer distinction between canonical and helper builders.

@benbellick
Copy link
Member

@tokoko yeah in that case I could imagine a nice API for the function in which you pass in a function by name, e.g. add and then the implementation deduces the expected function type and looks up a match in the ExtensionRegistry. To me that seems like a reasonable approach.

What seems unusual about this PRs approach is that you can pass in functions in the list which have no influence on the output plan because they would never have a matching implementation to begin with.

If the problem trying to be solved is that users sometimes don't care about which function it is, just that it has the appropriate name and type, then I think relegating that to the ExtensionRegistry makes more sense, rather than expecting users to produce a sufficient list of functions with the hopes that the list contains the appropriate match.

Of course, let me know if you disagree :)

@tokoko
Copy link
Contributor Author

tokoko commented Nov 3, 2025

nice API for the function in which you pass in a function by name, e.g. add and then the implementation deduces the expected function type and looks up a match in the ExtensionRegistry. To me that seems like a reasonable approach.

yeah, this is also a possibility and a cleaner one but with it's own downsides. It relies on the function name only, which means that if someone were to add an extension that reuses a function name from another, they will suddenly have to compete during lookup. For example, there is a slightly different implementation of add in an extension called extension:mydataengine:functions_arithmetic. If a user wants to use default add in one part of the plan and this slightly different impl in another, just a name-based lookup wouldn't be sufficient.

I could also imagine another case when you want to extend some logical add operator on your engine side by mapping it to both existing add functions in default extensions, plus some of your own implementations that work on your own user-defined types. What if you were to call the function add_mycustomtype instead of just add. I guess you could simply rename, but I thought it wasn't a reasonable limitation to impose. Maybe I'm overthinking this though 😆

What seems unusual about this PRs approach is that you can pass in functions in the list which have no influence on the output plan because they would never have a matching implementation to begin with.

That is true, some of the functions won't affect the plan if you zoom in on individual invocations. The way I justify it to myself is that we kind of already do that when we allow to builder to specify urn and function only instead of having the user to specify the exact impl of the function that needs to be used (after all the scalar function expr references the signature of an impl, nor just a function). In other words, the current behavior is essentially the same, you specify a function and hope the registry will be able to locate a suitable impl. The changes in this PR simply extend the api to allow the user to enlarge the impl search space to multiple functions instead of a single one, but the mechanics of the search stays the same.

relegating that to the ExtensionRegistry makes more sense

I'm not opposed to this, but there still should be some way to reach that functionality from the builder side, right? There still needs to be some function that accepts list of extension functions to pass onto the registry.

I'd also be fine with treating this as some sort of addon builder helper as long as it's easily usable from user code (for example, sql to substrait or narwhals to substrait conversions). I'm just not yet sure where we would place those and how we would name it :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants