-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue#21430 Avoid pruning DataframeTransforms #23069
Conversation
Avoided the pruning of composite transforms for Beam dataframe.
Tested merging Beam DataFrames. To workaround for #22587 R: @PhilippeMoussalli PTAL, thx! |
Codecov Report
@@ Coverage Diff @@
## master #23069 +/- ##
==========================================
+ Coverage 73.58% 73.59% +0.01%
==========================================
Files 716 716
Lines 95301 95331 +30
==========================================
+ Hits 70127 70159 +32
+ Misses 23878 23876 -2
Partials 1296 1296
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of other questions:
- Could we make the manual test you did into a unit test?
- I think we should file a follow-up issue to enable pruning within dataframe transform
def should_skip_pruning(transform: AppliedPTransform): | ||
return ( | ||
isinstance(transform.transform, TestStream) or | ||
'_DataFrame_' in transform.full_label) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we match on is instance DataframeTransform
instead of string matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataframeTransform expands into a bunch of normal ParDos, so DataframeTransform itself doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, if there's a minimally invasive change we can make in the DataFrame API that would allow this filter be more selective, I'd be in favor of making that here. But that doesn't need to be a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - it might be good to make sure this works for operations on DeferredSeries objects as well (i.e. do we need to match on "_Series"?). This could be challenging though. I don't think we've encountered an example of a series getting pruned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend adding unittests with ib.collect for each Series operation. In that case, we don't have to skip pruning prematurely.
The same also applies to the DeferredDataFrame operations. If applicable (when a proper pruning is in-place, in the future add some unittest with ib.collect (following the InteractiveDataFrameTest in this PR).
Let's keep the issue open but modify the description for proper pruning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, once there's a test passing in CI
Retest this please. |
Run Python_PVR_Flink PreCommit |
All tests have passed. One of the tests shows pending but has actually passed. |
Fixes #21430.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.