This repository was archived by the owner on Sep 2, 2022. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix savepoint problems #392
Fix savepoint problems #392
Changes from 6 commits
faba237
324099e
35c0b26
59cf152
f3e73ca
7b48c46
84ae554
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the operation is still in progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the comment here. This flow is still not 100% bug proof, I think more work needs to be done on savepoint flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then add more comments about the potential problems and TODOs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the doc there
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shashken @functicons There is no need to trigger savepoint here. This is because
shouldUpdateJob
checks if the latest savepoint exists and if it does not exit, savepoint will be triggered in other routine.restartJob
is just for restarting the job so there is no need to trigger savepoint.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shashken @functicons
It seems that savepoint should be triggered when
err == nil
.When I tested, sometimes I found that the savepoint was not triggered and only
status.job.lastSavepointTriggerTime
was updated.And FlinkCluster status is updated in
updateStatus
function of reconciler, therefore if you have a plan to make a new PR, it might be worth considering how to call the status update function once.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damm I tested the version b4 the CR change with that line, sorry about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shashken @functicons
restartJob
is just for restarting the job so there is no need to trigger savepoint here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it configurable as a field in the job spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be the best case, I increased it for the moment but I think we need to check the jobmanager's API to see SP status in the next SP PR. Do you think its bad we increased it to 15 mins? Is there a case where some1 will want to take SP every <15mins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant is actually the "minimal interval for triggering 2 savepoints", right? The name "Timeout" could be confusing, it might be mis-interpreted as "the timeout for taking a savepoint (before considering it as a failure)".
It is hard to determine the value. For example, I just took a savepoint 10 mins ago, but now I want to update my job, and I don't want to lose the state for the recent 10 mins, so I want it to take another savepoint before the update. Why do we need to introduce an arbitrary limit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three variables related to triggering savepoint.
SavepointAgeForJobUpdateSec
: savepoint age limit required for update progressSavepointRequestRetryIntervalSec
: retry interval for savepoint failure on updateSavepointTimeoutSec
: savepoint timeoutIn some cases, the savepoint may no longer proceed due to some errors, but the job manager may return the status normally. In that case, SavepointTimeoutSec is used to handle the timeout. For the jobs that require a long time to create savepoints, it would be better to change this variable to be user-configurable and set its default value large enough.
flink-on-k8s-operator/controllers/flinkcluster_util.go
Lines 45 to 51 in 17040f1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that it is possible to set the checkpoint timeout with the Flink configuration. In my opinion, it would be better to remove the Flink operator's savepoint timeout routine to resolve the second issue and guide related Flink configuration.
note: https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be better solved in another PR, this one provides a mitigation for clusters that SP takes more than a few seconds.
I think we should discuss how we want to solve this in another issue and consider to get the info about the SP (is there an active one, was it timeout, etc..) from the jobmanager itself.
I can make this one a part of the crd for now and then later delete it when it will no longer be needed (in another PR)
WTYT @elanv @functicons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, let's address it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response. When a checkpoint timeout occurs in Flink jobmanager, the savepoint state falls to "falied", so I don't think the first savepoint needs to be identified. The second issue is occurring because the default Flink checkpoint timeout is 10 minutes, but
SavepointTimeoutSec
is less than that. I think it's okay to handle that part in another PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which's the next PRs/issues? This one? #420
It seems to work as
the "minimal interval for triggering 2 savepoints"
but some docs showsautoSavepointSeconds: 300
as an example value and I actually specify the value. Is this limitation a temporary workaround?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SavepointTimeoutSec
is just savepoint timeout andautoSavepointSeconds
is the savepoint trigger interval as you mentioned. And #420 is the PR to improve savepoint routines including this issue.