-
Notifications
You must be signed in to change notification settings - Fork 14
Fix udf op #342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix udf op #342
Conversation
Documentation preview |
@@ -78,7 +79,8 @@ def transform( | |||
else: | |||
# shouldn't ever happen, | |||
raise RuntimeError(f"unhandled UDF param count {self._param_count}") | |||
return new_df | |||
# return input type data | |||
return make_df(new_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works ok for data frame based graphs, but we might need to revisit in a follow-up change to make this support tensortable too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think its fine, for tensortable it also works because the executor handles the change to something compatible for the next operator. https://github.com/NVIDIA-Merlin/core/blob/main/merlin/dag/executors.py#L105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about if a user-defined function returns a TensorColumn type, is that something we'd like to support?
It's probably worth adding a test for this situation where we pass a pandas data frame but the user defined function returns a cudf series and/or the other way around |
Test added. |
This PR fixes issues with the UDF op outlined here NVIDIA-Merlin/systems#360. This issue occurs when the user defined function is created with a targeted framework. However if the function is setup with a specific framework, the output would end up as the framework specified. This causes issues with the expected output (which should be the same as the input). The fix added here uses a dictionary instead of an input type collection. After the collection is complete we create a dataframe to ensure continued support for downstream operators. This is necessary to be able to all the different types of inputs possible (i.e. TensorTable, Cudf Dataframe, Pandas Dataframe). If using tensortable, it will be converted before the next operator in the executor.