Skip to content

Add Extension Type / Metadata support for Scalar UDFs #15646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Apr 8, 2025

Which issue does this PR close?

Rationale for this change

We have many users who wish to use extension data or other metadata when writing user defined functions. This PR enables two features for Scalar UDFs

  • When invoking the UDF, the field of each input column will be available, if it exists
  • The UDF can also specify output field that is attached to the schema of the record batch

What changes are included in this PR?

This is a fairly large change, but at a high level we add in a vector of argument fields to the ScalarFunctionArgs that gets passed when invoked. Additionally all Physical Expressions are now required to implement a new function to output their field. The rest of the work included is around plumbing these changes through the system, extracting the field from the input schema, and setting it as output.

Are these changes tested?

  • All existing unit tests pass.
  • Additional unit tests exercising this feature is added in datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate labels Apr 8, 2025
@timsaucer
Copy link
Contributor Author

Some of the concerns I have:

  • Canonical extension types in arrow-rs have an implementation of TryFrom<&Field> which lends towards the original Issue's suggestion of passing Field rather than just the metadata.
  • I need to add unit tests to make sure other operations don't throw away the metadata. I don't know if there is a one size fits all solution for which should an which should not pass through metadata. For example, alias should definitely keep the metadata IMO but what about functions that take two inputs? And all of the other single input functions - it's not clear if we need to handle this one a case by case basis or not.
  • If we do switch over to Field then it isn't immediately obvious to me what the name should be for the field. I need to investigate this a little more because there might already be a solution.

@paleolimbot
Copy link
Member

Just a link to my experiments starting from the Expr enum in case they are useful! #15036

@timsaucer
Copy link
Contributor Author

I have updated the PR to use Field instead of metadata HashMap<String, String>. In doing so we can now use extension types directly. I've added a second unit test that uses extension types, both canonical and a user defined.

@timsaucer timsaucer changed the title Add metadata support for Scalar UDFs Add Extension Type / Metadata support for Scalar UDFs Apr 9, 2025
@timsaucer timsaucer self-assigned this Apr 9, 2025
@timsaucer
Copy link
Contributor Author

I think Aggregate and Window UDFs should come as a separate PR. I did notice however that for Aggregates the input portion is already viable with this PR. Since AccumulatorArgs already passes in the input physical expression and input schema we would be able to compute the input fields. I've tested this locally with success. For Window functions we will want to add in the input schema.

@timsaucer timsaucer marked this pull request as ready for review April 9, 2025 19:14
@paleolimbot
Copy link
Member

I'll take a look this evening!

It's mostly updating to Arrow 55, but #15663 (in particular e37ef60 ) is the change required to support Extension types if arrow-rs supported them (minus reading them from files, which is admittedly a big minus).

@alamb
Copy link
Contributor

alamb commented Apr 10, 2025

This sounds like a good idea to me, but I suggest we wait until DF 47 is shipped to merge it

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR while this will be a painful downstream change for other crates, I don't really see any other way to support User Defined types / extension types and thus I think we should do make the change

I feel like this particular PR still leaves the code in an inconsistent state -- data type is still used somewhat inconsistently. If we are going to make this change I think we should fully update the API to be in terms of Field instead of DataType

FYI @rluvaton as I think this is very related to

Shall we just change functions to return Field? I think that would be the cleanest (though disruptive) solution

/// The evaluated arguments to the function
pub args: Vec<ColumnarValue>,
/// Field associated with each arg, if it exists
pub arg_fields: Vec<Option<&'b Field>>,
/// The number of rows in record batch being evaluated
pub number_rows: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should change the return type to also be pub return_field: &'a Field to be consistetn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also remove ReturnTypeInfo and return_type and have a single way to return values -- output_field (or return_field?)

@tobixdev
Copy link
Contributor

Just chiming in here. I also encountered the similar issues with UDFs not having access to (and being able to return) fields during my experiments.

Thanks for working on this 🚀! I'd really love to help out here, as I am craving for better UDTs support in DF. Unfortunately, I am a bit swamped at the moment but I'll find a few hours if there is something I can help with.

I also believe that we should update the return API to Field as well.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I have some questions primarily from the extension type/user-defined type angle, although I know that the primary motivation here is just metadata. It does seem like a thoughtful breaking change might result in a cleaner overall final design (but I'm very new here). In particular I wonder if a structure defined in datafusion (maybe ArgType or ExprType) with a to_field() would be less confusing/more flexible than a Field whose name and nullability is mostly ignored.

fn default() -> Self {
Self {
name: "canonical_extension_udf".to_string(),
signature: Signature::exact(vec![DataType::Int8], Volatility::Immutable),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the Signature also need additional options at some point such that it would have an opportunity to consider the metadata of the arguments?

let input_field = args.arg_fields[0].unwrap();

let output_as_bool = matches!(
CanonicalExtensionType::try_from(input_field),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things that I've worried about (but haven't benchmarked) is whether the parsing of the extension metadata (usually small amounts of JSON) will add up here (my sense is that it won't but it may be worth checking).

let array_ref = Arc::new(StringArray::from(array_values)) as ArrayRef;
Ok(ColumnarValue::Array(array_ref))
}
ColumnarValue::Scalar(value) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how extension scalars/literals would fit in here (perhaps Literal will also need a lit_field() or ScalarValue::Extension would be needed)

Comment on lines +1615 to +1616
Field::new("canonical_extension_udf", DataType::Utf8, true)
.with_extension_type(MyUserExtentionType {}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the name and/or nullability should be in these cases (this also came up in my adventures with the Expr enum). If we don't actually need them/their value is always ignored, I wonder if a dedicated structure would be better (e.g., closer to your original version that just used a HashMap).

@rluvaton
Copy link
Contributor

FYI @rluvaton as I think this is very related to

Shall we just change functions to return Field? I think that would be the cleanest (though disruptive) solution

@alamb Not sure how this is solving the problem that I said in the issue

@timsaucer
Copy link
Contributor Author

I need to take some time to review these comments and think more about it, likely next week. Also I'm dropping a note for myself that the current implementation isn't sufficient for my needs because I need the UDF to be able to compute the output field based on the arguments and my function signature for output_field will not enable that.

Case in point: GetFieldFunc should return the field with meta data for the subfield of the struct, so it needs to know which field it was called for.

@timsaucer timsaucer marked this pull request as draft April 11, 2025 18:35
@alamb
Copy link
Contributor

alamb commented Apr 14, 2025

FYI @rluvaton as I think this is very related to

Shall we just change functions to return Field? I think that would be the cleanest (though disruptive) solution

@alamb Not sure how this is solving the problem that I said in the issue

What I was thinking is that if we are going to be changing ScalarFunction again so that instead of returning a DataType for return_type_from_args it would return a Field

The Field includes nullability information (as well as extension type information)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change ReturnTypeInfo to return a Field rather than DataType
5 participants