-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Add checkpoint during progress reporting. #34828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| continue progress | ||
| } | ||
| fraction = 0.5 | ||
| } else if checkpointReady && unsplit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should do this checkpointing for both bounded and unbounded cases.
@lostluck: WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition is telling me that we should not to add it for the bounded case. But the essence of Beam is to unify batch and streaming so it's probably fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm worried about the situation where we're just oversplitting and forcing a checkpoint on a perfectly fine running Batch DoFn (processing efficiently and per element fast).
As it stands, this code looks like it will checkpoint a fast moving bundle after 1 second. (10 ticks, at 100ms per tick, since a fast moving bundle won't get a slower progress request rate).
But that's sort of wasteful. A fast moving bundle shouldn't be stopped. We might just want to only do this if the bundle is moving fast, but the input index isn't moving? Perhaps this is a reason why Dataflow has explicit Batch and Streaming modes of execution.
One almost wants to do it based on the number or amount of output data instead, in order to allow the watermark to progress.... But that would be much harder, and is overthinking it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpointing is always a trade-off. In theory, we don't want to checkpoint too often to hurt performance, while we also want to checkpoint sufficiently enough so the hard work can be materialized and saved.
A fast moving bundle shouldn't be stopped.
I think we can consider using the checkpoint ticks AND the amount or rate of output data ("totalCount") as the criteria to identify a fast-moving bundle (thousands of events per tick) that lasts reasonably long. Instead of 1 second, we can change it to 10 (or even longer) seconds for example.
Even if it is fast moving, we may still want to checkpoint to make sure we don't need to repeat the previous 10-second work if something goes bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we have bundle retrying implemented, we can adjust the threshold of checkpoint ticks to longer or shorter based on how often we see an error in the bundle, how long does it take to check point, etc.
| continue progress | ||
| } | ||
| // Save residual roots for checkpoint. After checkpointing is successful, | ||
| // the bundle will be marked as finished and no residual roots will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of a surprise. When a bundle finishes due to splitting with 0.0 fraction, no residual roots in the response. Is this by design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is expected and by design.
Were you seeing errors due to returning the residuals early in the split response for checkpointing case?
There are a few different cases to think about, but they're aligned with the two FnAPI calls in question.
-
Normal ProcessBundleResponse: Returns when the primary is completed, there are no residuals to worry about.
-
Split Response + ProcessBundleResponse:
The Split Response contains the confirmation of the primary (what the bundle will finish processing), and the residual that needs to be processed later. ProcessBundleResponse will not contain any residuals at this time, since they were already persisted by the split response (per the above). -
Self Checkpointed ProcessBundleResponse: This is when the DoFn itself returns a process continuation for a specific element (eg. Resume in 10s or similar). The Primary is by definition completed, but there may be residuals to process later. That's what's returned and scheduled.
You're seeing 2 in this case. We shouldn't need to do any additional residual handling and processing after the bundle is finished here. I'd be a bit concerned that there is a data duplication risk when doing it this way (the same residuals getting "returned" twice.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So my current code is correct then, since I am using the Residual from SplitResponse (rather than Residual from ProcessBundleResponse in the original code) to compute watermark and residual data outside of the progress for loop.
QQ: On case 2, what if we have a split at fraction 0.5? Prior to my change, I think the code is relying on Residual from ProcessBundleResponse to update the watermark. However isn't the residual empty there after we have a split response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So residualRoots is set on line 260, but the ResigdualRoots is called again and processed and sent back to the EM on line 282.
That variable is never un-set after processing for bundles.
Then after the bundle finishes, the residual roots are only overridden If and only if the final bundle has residual roots already. Therefore the cached roots from the split response might be processed a second time, being sent to the EM as part of PersistBundle, which then also reschedules them. So it may duplicate the residual data.
IIUC the better fix is to handle the output watermark estimate in the em.ReturnResidual call. Right now it's only happening in PersistBundle.
PersistBundle call: https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/engine/elementmanager.go#L905
ReturnResiduals call:
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/engine/elementmanager.go#L1023
Alternatively the MinOutputWatermarks for residuals is independent of the specific data, so it should also be valid to collect/update them with splits, but not persist them until the PersistBundle call. That matches closely what you have here (and what works), but without the duplicated data.
MinOutputWatermarks are collected here: https://github.com/apache/beam/pull/34828/files#diff-c799dce79559a70660d7abb42fcbff8455ba41452bd9483fc5c58dfcf156ee8cR343
Map is created just above currently: https://github.com/apache/beam/pull/34828/files#diff-c799dce79559a70660d7abb42fcbff8455ba41452bd9483fc5c58dfcf156ee8cR322
So we'd just need to ensure we don't have "stale" watermarks being persisted for this holding things back by accident.
|
Assigning reviewers. If you would like to opt out of this review, comment R: @jrmccluskey for label go. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
| slog.LogAttrs(context.TODO(), slog.LevelError, "returned empty residual application", slog.Any("bundle", rb)) | ||
| panic("sdk returned empty residual application") | ||
| } | ||
| // TODO what happens to output watermarks on splits? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well at least i left a note about the output watermarks when I last touched this.
I think the "correct" thing to do here is to collect them and apply them accordingly with PersistBundle, instead of reprocessing the whole set of residuals later (which is how the phone is rendering it. I'll need to reread it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I didn't quite follow the part where residual from split response is handled in the original code.
The latter part outside of the progress for loop makes more sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Estimated input elements bit is tricky since it's about how to estimate where to split for Unbounded SDFs and how big a bundle "is". I can't recall exactly why it ended up with the "filter out residuals that are before the end of data", cases... but apparently it had to do with timers?
https://github.com/apache/beam/blame/master/sdks/go/pkg/beam/runners/prism/internal/stage.go#L253
|
Reminder, please take a look at this pr: @jrmccluskey |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @lostluck for label go. Available commands:
|
|
waiting on author |
|
Assign that back to me as I need to make some more changes. |
|
Reminder, please take a look at this pr: @lostluck |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @jrmccluskey for label go. Available commands:
|
|
Converting this back to draft for now as it needs some more thoughts. |
|
Reminder, please take a look at this pr: @jrmccluskey |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @lostluck for label go. Available commands:
|
|
Reminder, please take a look at this pr: @lostluck |
|
waiting on author (since the bot doesn't know how drafts work) |
|
waiting on author |
|
It is a draft PR, and I don't have bandwidth to move forward with that recently. Feel free to close it if the bot is making too much noise. |
|
Reminder, please take a look at this pr: @lostluck |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @jrmccluskey for label go. Available commands:
|
|
The place in the center |
|
Closing this for now. |
addresses #33815
I verified that the previous stuck sample code in #33815 (comment) and #33815 (comment) is working with this PR's changes.