Fix savepoint problems #392

shashken · 2021-01-11T13:24:20Z

I found 2 problems related to savepoints:

When upgrading a job, there was no option to take a savepoint before upgrading (and using it to restore)
added a flag to fix this case
When a cluster starts it tries to take a savepoint, the savepoint status only updates once it completes, this creates a situation where a new savepoint gets triggered while the previous one is still running, and it keeps happening if your savepoints won't finish quickly (forever)
I solved this with another value that holds the savepoint trigger time, and an increased savepoint timeout, so while a savepoint is still running a new one will not get triggered.

@functicons I'd love to get your feedback on this, we might want to create a stronger solution later on but for now savepoints are impossible to use with this operator if they take some time.

changes

Use StatefulSet instead of Deployment (GoogleCloudPlatform#354)

Update Helm chart CRD for PR GoogleCloudPlatform#379 (GoogleCloudPlatform#386)

functicons · 2021-01-13T01:54:50Z

/gcbrun

functicons · 2021-01-13T02:00:03Z

Thanks for the PR, will review as soon as I get a chance.

functicons

Left some comments, thanks!

functicons · 2021-01-13T05:03:18Z

api/v1beta1/flinkcluster_types.go

-	SavepointTriggerReasonUpdate        = "update"
+	SavepointTriggerReasonUserRequested    = "user requested"
+	SavepointTriggerReasonScheduled        = "scheduled"
+	SavepointTriggerReasonScheduledInitial = "scheduledInitial"


Add a comment about this reason.

functicons · 2021-01-13T05:04:37Z

api/v1beta1/flinkcluster_types.go

@@ -347,6 +348,9 @@ type JobSpec struct {
 	// Allow non-restored state, default: false.
 	AllowNonRestoredState *bool `json:"allowNonRestoredState,omitempty"`

+	// Should take savepoint before upgrading the job, default: false.
+	ShouldTakeSavepointOnUpgrade *bool `json:"shouldTakeSavepointOnUpgrade,omitempty"`


nit: s/ShouldTakeSavepointOnUpgrade/TakeSavepointOnUpdate.

@shashken It would be nice to consider setting a default value for takeSavepointOnUpgrade field. Otherwise it could lead to a nil pointer error like #408. You can easily set it with a marker like kubebuilder:default.

And considering the naming consistency, the field name TakeSavepointOnUpdate looks better to me.

@shashken @functicons And the default value is false in the comment. Wouldn't it be better to set it to true? It seems common to restore the job from the latest savepoint.

functicons · 2021-01-13T05:05:30Z

api/v1beta1/flinkcluster_types.go

@@ -567,6 +571,9 @@ type JobStatus struct {
 	// Last savepoint trigger ID.
 	LastSavepointTriggerID string `json:"lastSavepointTriggerID,omitempty"`

+	// Last successful or failed savepoint operation timestamp.


What if the operation is still in progress?

I'll change the comment here. This flow is still not 100% bug proof, I think more work needs to be done on savepoint flow.

Then add more comments about the potential problems and TODOs.

Fixed the doc there

functicons · 2021-01-13T05:07:33Z

controllers/flinkcluster_reconciler.go

 		if shouldUpdateJob(observed) {
 			log.Info("Job is about to be restarted to update")
-			restartJob = true
+			err := reconciler.restartJob(*jobSpec.ShouldTakeSavepointOnUpgrade)


I prefer keeping the existing structure and introducing a variable takeSavepointOnUpdate for the 2 cases.

You sure? each case needs to pass a different argument to restartJob. I think its much cleaner like that

functicons · 2021-01-13T05:08:40Z

controllers/flinkcluster_reconciler.go

-			if ok, savepointTriggerReason := reconciler.shouldTakeSavepoint(); ok {
+			shouldTakeSavepont, savepointTriggerReason := reconciler.shouldTakeSavepoint()
+			if shouldTakeSavepont {
+				reconciler.updateSavepointTriggerTimeStatus()


This method might return error.

functicons · 2021-01-13T05:11:00Z

controllers/flinkcluster_util.go

@@ -42,7 +42,7 @@ const (
 	ControlRetries            = "retries"
 	ControlMaxRetries         = "3"

-	SavepointTimeoutSec = 60
+	SavepointTimeoutSec = 900 // 15 mins


Can we make it configurable as a field in the job spec?

This might not be the best case, I increased it for the moment but I think we need to check the jobmanager's API to see SP status in the next SP PR. Do you think its bad we increased it to 15 mins? Is there a case where some1 will want to take SP every <15mins?

This constant is actually the "minimal interval for triggering 2 savepoints", right? The name "Timeout" could be confusing, it might be mis-interpreted as "the timeout for taking a savepoint (before considering it as a failure)".

It is hard to determine the value. For example, I just took a savepoint 10 mins ago, but now I want to update my job, and I don't want to lose the state for the recent 10 mins, so I want it to take another savepoint before the update. Why do we need to introduce an arbitrary limit here?

There are three variables related to triggering savepoint.

SavepointAgeForJobUpdateSec: savepoint age limit required for update progress

SavepointRequestRetryIntervalSec: retry interval for savepoint failure on update

SavepointTimeoutSec: savepoint timeout

In some cases, the savepoint may no longer proceed due to some errors, but the job manager may return the status normally. In that case, SavepointTimeoutSec is used to handle the timeout. For the jobs that require a long time to create savepoints, it would be better to change this variable to be user-configurable and set its default value large enough.

flink-on-k8s-operator/controllers/flinkcluster_util.go

Lines 45 to 51 in 17040f1

SavepointTimeoutSec = 60

RevisionNameLabel = "flinkoperator.k8s.io/revision-name"

// TODO: need to be user configurable

SavepointAgeForJobUpdateSec = 300

SavepointRequestRetryIntervalSec = 10

I found that it is possible to set the checkpoint timeout with the Flink configuration. In my opinion, it would be better to remove the Flink operator's savepoint timeout routine to resolve the second issue and guide related Flink configuration.

note: https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-timeout

I think this might be better solved in another PR, this one provides a mitigation for clusters that SP takes more than a few seconds.
I think we should discuss how we want to solve this in another issue and consider to get the info about the SP (is there an active one, was it timeout, etc..) from the jobmanager itself.
I can make this one a part of the crd for now and then later delete it when it will no longer be needed (in another PR)
WTYT @elanv @functicons

SGTM, let's address it in another PR.

Sorry for the late response. When a checkpoint timeout occurs in Flink jobmanager, the savepoint state falls to "falied", so I don't think the first savepoint needs to be identified. The second issue is occurring because the default Flink checkpoint timeout is 10 minutes, but SavepointTimeoutSec is less than that. I think it's okay to handle that part in another PR.

Which's the next PRs/issues? This one? #420

It seems to work as the "minimal interval for triggering 2 savepoints" but some docs shows autoSavepointSeconds: 300 as an example value and I actually specify the value. Is this limitation a temporary workaround?

SavepointTimeoutSec is just savepoint timeout and autoSavepointSeconds is the savepoint trigger interval as you mentioned. And #420 is the PR to improve savepoint routines including this issue.

functicons · 2021-01-13T05:11:45Z

helm-chart/flink-operator/Chart.yaml

@@ -2,7 +2,7 @@ apiVersion: v1
 name: flink-operator
 appVersion: "1.0"
 description: A Helm chart for flink on Kubernetes operator
-version: "0.2.0"
+version: "0.2.5"


Why upgrade to 5?

functicons · 2021-01-13T05:20:34Z

controllers/flinkcluster_reconciler.go

@@ -744,13 +743,21 @@ func (reconciler *ClusterReconciler) shouldTakeSavepoint() (bool, string) {
 		return false, ""
 	}

+	var lastTriggerTime = time.Time{}


Extract this block into a helper method and add comments.

api/v1beta1/flinkcluster_types.go

functicons · 2021-01-19T23:57:16Z

api/v1beta1/flinkcluster_types.go

@@ -567,6 +571,9 @@ type JobStatus struct {
 	// Last savepoint trigger ID.
 	LastSavepointTriggerID string `json:"lastSavepointTriggerID,omitempty"`

+	// Last successful or failed savepoint operation timestamp.


Then add more comments about the potential problems and TODOs.

controllers/flinkcluster_reconciler.go

functicons · 2021-01-20T00:07:33Z

controllers/flinkcluster_util.go

@@ -42,7 +42,7 @@ const (
 	ControlRetries            = "retries"
 	ControlMaxRetries         = "3"

-	SavepointTimeoutSec = 60
+	SavepointTimeoutSec = 900 // 15 mins


This constant is actually the "minimal interval for triggering 2 savepoints", right? The name "Timeout" could be confusing, it might be mis-interpreted as "the timeout for taking a savepoint (before considering it as a failure)".

It is hard to determine the value. For example, I just took a savepoint 10 mins ago, but now I want to update my job, and I don't want to lose the state for the recent 10 mins, so I want it to take another savepoint before the update. Why do we need to introduce an arbitrary limit here?

functicons · 2021-02-01T00:34:33Z

controllers/flinkcluster_util.go

@@ -42,7 +42,7 @@ const (
 	ControlRetries            = "retries"
 	ControlMaxRetries         = "3"

-	SavepointTimeoutSec = 60
+	SavepointTimeoutSec = 900 // 15 mins


SGTM, let's address it in another PR.

functicons · 2021-02-01T00:34:57Z

/gcbrun

elanv · 2021-02-02T13:36:18Z

controllers/flinkcluster_reconciler.go

+				err = reconciler.updateSavepointTriggerTimeStatus()
+				if err != nil {
+					newSavepointStatus, _ = reconciler.takeSavepointAsync(jobID, savepointTriggerReason)
+				}


@shashken @functicons
It seems that savepoint should be triggered when err == nil.
When I tested, sometimes I found that the savepoint was not triggered and only status.job.lastSavepointTriggerTime was updated.

And FlinkCluster status is updated in updateStatus function of reconciler, therefore if you have a plan to make a new PR, it might be worth considering how to call the status update function once.

Damm I tested the version b4 the CR change with that line, sorry about that

elanv · 2021-02-15T16:06:15Z

controllers/flinkcluster_reconciler.go

-		var restartJob bool
 		if shouldUpdateJob(observed) {
 			log.Info("Job is about to be restarted to update")
-			restartJob = true
+			err := reconciler.restartJob(*jobSpec.TakeSavepointOnUpgrade)
+			return requeueResult, err
 		} else if shouldRestartJob(restartPolicy, recordedJobStatus) {
 			log.Info("Job is about to be restarted to recover failure")
-			restartJob = true
-		}
-		if restartJob {
-			err := reconciler.restartJob()
-			if err != nil {
-				return requeueResult, err
-			}
-			return requeueResult, nil
+			err := reconciler.restartJob(false)
+			return requeueResult, err
 		}


@shashken @functicons There is no need to trigger savepoint here. This is because shouldUpdateJob checks if the latest savepoint exists and if it does not exit, savepoint will be triggered in other routine. restartJob is just for restarting the job so there is no need to trigger savepoint.

elanv · 2021-02-15T16:10:32Z

controllers/flinkcluster_reconciler.go

@@ -575,14 +573,15 @@ func (reconciler *ClusterReconciler) getFlinkJobID() string {
 	return ""
 }

-func (reconciler *ClusterReconciler) restartJob() error {
+func (reconciler *ClusterReconciler) restartJob(shouldTakeSavepoint bool) error {


@shashken @functicons restartJob is just for restarting the job so there is no need to trigger savepoint here.

shashken and others added 5 commits December 8, 2020 18:25

Merge pull request #1 from GoogleCloudPlatform/master

faba237

changes

Merge pull request #2 from GoogleCloudPlatform/master

324099e

Use StatefulSet instead of Deployment (GoogleCloudPlatform#354)

Merge pull request #3 from GoogleCloudPlatform/master

35c0b26

Update Helm chart CRD for PR GoogleCloudPlatform#379 (GoogleCloudPlatform#386)

Add option to take savepoint right before cluster upgrade

59cf152

Fix savepoint flow where SP was triggered infinitely

f3e73ca

shashken mentioned this pull request Jan 11, 2021

Autoscaling enhancement #389

Open

functicons self-requested a review January 13, 2021 01:54

functicons suggested changes Jan 13, 2021

View reviewed changes

Fix CR

7b48c46

shashken requested a review from functicons January 19, 2021 14:08

functicons suggested changes Jan 20, 2021

View reviewed changes

Fix CR

84ae554

shashken requested a review from functicons January 25, 2021 17:24

functicons approved these changes Feb 1, 2021

View reviewed changes

functicons merged commit 54a7d09 into GoogleCloudPlatform:master Feb 1, 2021

elanv reviewed Feb 2, 2021

View reviewed changes

elanv mentioned this pull request Feb 2, 2021

Fix job recovery and savepoint bug #401

Merged

shashken mentioned this pull request Feb 3, 2021

Savepoints flow changes #404

Open

jiamo mentioned this pull request Feb 4, 2021

create jobcluster failed #403

Closed

elanv reviewed Feb 15, 2021

View reviewed changes

	SavepointTimeoutSec = 60

	RevisionNameLabel = "flinkoperator.k8s.io/revision-name"

	// TODO: need to be user configurable
	SavepointAgeForJobUpdateSec = 300
	SavepointRequestRetryIntervalSec = 10

Fix savepoint problems #392

Fix savepoint problems #392

Uh oh!

Conversation

shashken commented Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

functicons commented Jan 13, 2021

Uh oh!

functicons commented Jan 13, 2021

Uh oh!

functicons left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elanv Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elanv Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elanv Jan 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elanv Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elanv Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

functicons commented Feb 1, 2021

Uh oh!

shashken commented Jan 11, 2021 •

edited

Loading

elanv Feb 15, 2021 •

edited

Loading

elanv Jan 20, 2021 •

edited

Loading

elanv Jan 23, 2021 •

edited

Loading

elanv Feb 1, 2021 •

edited

Loading

elanv Mar 9, 2021 •

edited

Loading

elanv Feb 2, 2021 •

edited

Loading

elanv Feb 15, 2021 •

edited

Loading