-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add Extension Type / Metadata support for Scalar UDFs #15646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Extension Type / Metadata support for Scalar UDFs #15646
Conversation
…ned an owned version of the metadata for simplicity
Some of the concerns I have:
|
Just a link to my experiments starting from the Expr enum in case they are useful! #15036 |
…handle extension types directly
I have updated the PR to use |
I think Aggregate and Window UDFs should come as a separate PR. I did notice however that for Aggregates the input portion is already viable with this PR. Since |
This sounds like a good idea to me, but I suggest we wait until DF 47 is shipped to merge it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR while this will be a painful downstream change for other crates, I don't really see any other way to support User Defined types / extension types and thus I think we should do make the change
I feel like this particular PR still leaves the code in an inconsistent state -- data type is still used somewhat inconsistently. If we are going to make this change I think we should fully update the API to be in terms of Field
instead of DataType
FYI @rluvaton as I think this is very related to
Shall we just change functions to return Field
? I think that would be the cleanest (though disruptive) solution
/// The evaluated arguments to the function | ||
pub args: Vec<ColumnarValue>, | ||
/// Field associated with each arg, if it exists | ||
pub arg_fields: Vec<Option<&'b Field>>, | ||
/// The number of rows in record batch being evaluated | ||
pub number_rows: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should change the return type to also be pub return_field: &'a Field
to be consistetn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also remove ReturnTypeInfo
and return_type
and have a single way to return values -- output_field
(or return_field
?)
Just chiming in here. I also encountered the similar issues with UDFs not having access to (and being able to return) fields during my experiments. Thanks for working on this 🚀! I'd really love to help out here, as I am craving for better UDTs support in DF. Unfortunately, I am a bit swamped at the moment but I'll find a few hours if there is something I can help with. I also believe that we should update the return API to Field as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I have some questions primarily from the extension type/user-defined type angle, although I know that the primary motivation here is just metadata. It does seem like a thoughtful breaking change might result in a cleaner overall final design (but I'm very new here). In particular I wonder if a structure defined in datafusion (maybe ArgType
or ExprType
) with a to_field()
would be less confusing/more flexible than a Field
whose name and nullability is mostly ignored.
fn default() -> Self { | ||
Self { | ||
name: "canonical_extension_udf".to_string(), | ||
signature: Signature::exact(vec![DataType::Int8], Volatility::Immutable), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the Signature
also need additional options at some point such that it would have an opportunity to consider the metadata of the arguments?
let input_field = args.arg_fields[0].unwrap(); | ||
|
||
let output_as_bool = matches!( | ||
CanonicalExtensionType::try_from(input_field), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the things that I've worried about (but haven't benchmarked) is whether the parsing of the extension metadata (usually small amounts of JSON) will add up here (my sense is that it won't but it may be worth checking).
let array_ref = Arc::new(StringArray::from(array_values)) as ArrayRef; | ||
Ok(ColumnarValue::Array(array_ref)) | ||
} | ||
ColumnarValue::Scalar(value) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how extension scalars/literals would fit in here (perhaps Literal
will also need a lit_field()
or ScalarValue::Extension
would be needed)
Field::new("canonical_extension_udf", DataType::Utf8, true) | ||
.with_extension_type(MyUserExtentionType {}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what the name and/or nullability should be in these cases (this also came up in my adventures with the Expr
enum). If we don't actually need them/their value is always ignored, I wonder if a dedicated structure would be better (e.g., closer to your original version that just used a HashMap
).
I need to take some time to review these comments and think more about it, likely next week. Also I'm dropping a note for myself that the current implementation isn't sufficient for my needs because I need the UDF to be able to compute the output field based on the arguments and my function signature for Case in point: |
What I was thinking is that if we are going to be changing ScalarFunction again so that instead of returning a The |
Which issue does this PR close?
ReturnTypeInfo
to return aField
rather thanDataType
#14247 but it is a different solution than requestedRationale for this change
We have many users who wish to use extension data or other metadata when writing user defined functions. This PR enables two features for Scalar UDFs
What changes are included in this PR?
This is a fairly large change, but at a high level we add in a vector of argument fields to the
ScalarFunctionArgs
that gets passed when invoked. Additionally all Physical Expressions are now required to implement a new function to output their field. The rest of the work included is around plumbing these changes through the system, extracting the field from the input schema, and setting it as output.Are these changes tested?
datafusion/core/tests/user_defined/user_defined_scalar_functions.rs