-
Notifications
You must be signed in to change notification settings - Fork 16
feat: allow passing multiple functions to function builders #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
it's a breaking change, but easy enough to change even if someone already actually depends on this stuff. |
|
I'm wondering whether we should have a definition on the Substrait spec level on how to handle function extension merging / prioritization / resolution. @benbellick recently introduced a change in substrait-java/isthmus that uses a priority order for extension YAML files in order to resolve duplicate function signatures across files but only when mapping to/from Calcite. I'm not sure how all the other Substrait implementations handle this. I assume that it would be good to have some consistency on this aspect. What do you think? |
|
as far as I understand, prioritization is an issue only if you try to look for functions by name only, right? The way ExtensionRegistry in substait-python is implemented right now, it always asks for uri/urn and name to locate the function, so there is not really a place for duplicates as long as extension files themselves are valid and urns are in fact unique. The design proposed in this PR expects input to look something like this -> |
|
right, these are different yet similar approaches. since Isthmus may start from a SQL string it might only have the name and it would need to find the right function across a set of YAML file by identifying the function with the matching function signature. Then if you had multiple functions with the same signature the YAML file priority comes in. In substrait-python you identify the function by URN/URI and function name and you try to allow defining which ones to pick from. Which also allows one to define priority on the function level since I guess it would pick the first one that matches the data types. |
f2a4550 to
890f84b
Compare
|
This is back on the market after incorporating urn changes. The api for single function now looks like this -> |
|
@nielspardon I felt that this change in substrait-java was necessary specifically because it is indeed valid substrait to have multiple functions with the same implementation but different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment
| def scalar_function( | ||
| urn: str, | ||
| function: str, | ||
| function: Union[str, Iterable[str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tokoko I am open minded to this approach, but I would strongly prefer not introducing string parsing if it is possible. Instead, maybe the API could be something like:
| function: Union[str, Iterable[str]], | |
| function: Union[ExtensionID, Iterable[ExtensionID]], |
where
@dataclass
class ExtensionID:
urn: str
function: strWhat do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also be open minded to just having a separate function for the individual and the list case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of parsing strings either. I simply wanted not to clutter the api too much. How about using NamedTuple instead of a dataclass. It would allow the user to pass plain tuples as well.
from typing import NamedTuple, Union, Iterable
class ExtensionFunction(NamedTuple):
urn: str
function: str
def process_func(func: Union[ExtensionFunction, Iterable[ExtensionFunction]]):
functions = [func] if isinstance(func[0], str) else func
for f in functions:
urn, name = f
print(urn)
print(name)
process_func(ExtensionFunction("sample_urn", "sample_func"))
process_func(("sample_urn", "sample_func"))
process_func([("sample_urn", "sample_func1"), ("sample_urn", "sample_func2")])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with that approach as an API for allowing multiple parameters, though I still have my hesitations about the PR as a whole as expressed in the below comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further thought, I am not sure I see the value of this. IIUC, what is happening is we find the first function with matching implementation in the list and use that one for extended_expression construction?
To be this seems to be a lot less explicit than the current implementation.
You write:
while the user often doesn't care which one is used as long as the one with correct input types can be located.
Is that really true in general? The point of having different functions is to have different semantics after all, and I think the user should be expected to be as clear as possible which implementation they intend on using, no?
If we wanted to offer a facility for looking up the first matching function by name and signature irrespective of uri/urn, then I could imagine something like that in the ExtensionRegistry.
If you feel strongly that this feature would be useful, then to me it makes more sense to exist as a separate helper function, rather than the canonical builder function. What do you think?
The immediate use case for this is sql conversion.
I don't disagree that it's a "helper", but to be honest I sort of treated builders as "helpers" rather than strictly canonical in the first place. For example there are ways to provide columns by name (rather than an index) to a projection which is hardly canonical for substrait. I'm not trying to use that as an argument, though... maybe we should in fact have a clearer distinction between canonical and helper builders. |
|
@tokoko yeah in that case I could imagine a nice API for the function in which you pass in a function by name, e.g. What seems unusual about this PRs approach is that you can pass in functions in the list which have no influence on the output plan because they would never have a matching implementation to begin with. If the problem trying to be solved is that users sometimes don't care about which function it is, just that it has the appropriate name and type, then I think relegating that to the Of course, let me know if you disagree :) |
yeah, this is also a possibility and a cleaner one but with it's own downsides. It relies on the function name only, which means that if someone were to add an extension that reuses a function name from another, they will suddenly have to compete during lookup. For example, there is a slightly different implementation of I could also imagine another case when you want to extend some logical add operator on your engine side by mapping it to both existing
That is true, some of the functions won't affect the plan if you zoom in on individual invocations. The way I justify it to myself is that we kind of already do that when we allow to builder to specify urn and function only instead of having the user to specify the exact impl of the function that needs to be used (after all the scalar function expr references the signature of an impl, nor just a function). In other words, the current behavior is essentially the same, you specify a function and hope the registry will be able to locate a suitable impl. The changes in this PR simply extend the api to allow the user to enlarge the impl search space to multiple functions instead of a single one, but the mechanics of the search stays the same.
I'm not opposed to this, but there still should be some way to reach that functionality from the builder side, right? There still needs to be some function that accepts list of extension functions to pass onto the registry. I'd also be fine with treating this as some sort of addon builder helper as long as it's easily usable from user code (for example, sql to substrait or narwhals to substrait conversions). I'm just not yet sure where we would place those and how we would name it :). |
changes builder api for all functions so one can easily pass multiple function refs to the builder.