Skip to content
This repository was archived by the owner on Sep 2, 2022. It is now read-only.

Fix savepoint problems #392

Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions api/v1beta1/flinkcluster_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -98,10 +98,11 @@ const (
SavepointStateFailed = "Failed"
SavepointStateSucceeded = "Succeeded"

SavepointTriggerReasonUserRequested = "user requested"
SavepointTriggerReasonScheduled = "scheduled"
SavepointTriggerReasonJobCancel = "job cancel"
SavepointTriggerReasonUpdate = "update"
SavepointTriggerReasonUserRequested = "user requested"
SavepointTriggerReasonScheduled = "scheduled"
SavepointTriggerReasonScheduledInitial = "scheduled initial" // The first triggered savepoint has slightly different flow
SavepointTriggerReasonJobCancel = "job cancel"
SavepointTriggerReasonUpdate = "update"
)

// ImageSpec defines Flink image of JobManager and TaskManager containers.
Expand Down Expand Up @@ -347,6 +348,9 @@ type JobSpec struct {
// Allow non-restored state, default: false.
AllowNonRestoredState *bool `json:"allowNonRestoredState,omitempty"`

// Should take savepoint before upgrading the job, default: false.
TakeSavepointOnUpgrade *bool `json:"takeSavepointOnUpgrade,omitempty"`

// Savepoints dir where to store savepoints of the job.
SavepointsDir *string `json:"savepointsDir,omitempty"`

Expand Down Expand Up @@ -567,6 +571,9 @@ type JobStatus struct {
// Last savepoint trigger ID.
LastSavepointTriggerID string `json:"lastSavepointTriggerID,omitempty"`

// Last successful or failed savepoint operation timestamp.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the operation is still in progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the comment here. This flow is still not 100% bug proof, I think more work needs to be done on savepoint flow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then add more comments about the potential problems and TODOs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the doc there

LastSavepointTriggerTime string `json:"lastSavepointTriggerTime,omitempty"`

// Last successful or failed savepoint operation timestamp.
LastSavepointTime string `json:"lastSavepointTime,omitempty"`

Expand Down
5 changes: 5 additions & 0 deletions api/v1beta1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions config/crd/bases/flinkoperator.k8s.io_flinkclusters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,8 @@ spec:
type: integer
cancelRequested:
type: boolean
takeSavepointOnUpgrade:
type: boolean
className:
type: string
cleanupPolicy:
Expand Down Expand Up @@ -5146,6 +5148,8 @@ spec:
type: string
id:
type: string
lastSavepointTriggerTime:
type: string
lastSavepointTime:
type: string
lastSavepointTriggerID:
Expand Down
61 changes: 40 additions & 21 deletions controllers/flinkcluster_reconciler.go
Original file line number Diff line number Diff line change
Expand Up @@ -468,31 +468,29 @@ func (reconciler *ClusterReconciler) reconcileJob() (ctrl.Result, error) {
var jobID = reconciler.getFlinkJobID()
var restartPolicy = observed.cluster.Spec.Job.RestartPolicy
var recordedJobStatus = observed.cluster.Status.Components.Job
var jobSpec = reconciler.observed.cluster.Spec.Job

// Update or recover Flink job by restart.
var restartJob bool
if shouldUpdateJob(observed) {
log.Info("Job is about to be restarted to update")
restartJob = true
err := reconciler.restartJob(*jobSpec.TakeSavepointOnUpgrade)
return requeueResult, err
} else if shouldRestartJob(restartPolicy, recordedJobStatus) {
log.Info("Job is about to be restarted to recover failure")
restartJob = true
}
if restartJob {
err := reconciler.restartJob()
if err != nil {
return requeueResult, err
}
return requeueResult, nil
err := reconciler.restartJob(false)
return requeueResult, err
}
Comment on lines -473 to 482
Copy link
Contributor

@elanv elanv Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashken @functicons There is no need to trigger savepoint here. This is because shouldUpdateJob checks if the latest savepoint exists and if it does not exit, savepoint will be triggered in other routine. restartJob is just for restarting the job so there is no need to trigger savepoint.


// Trigger savepoint if required.
if len(jobID) > 0 {
if ok, savepointTriggerReason := reconciler.shouldTakeSavepoint(); ok {
newSavepointStatus, _ = reconciler.takeSavepointAsync(jobID, savepointTriggerReason)
shouldTakeSavepont, savepointTriggerReason := reconciler.shouldTakeSavepoint()
if shouldTakeSavepont {
err = reconciler.updateSavepointTriggerTimeStatus()
if err != nil {
newSavepointStatus, _ = reconciler.takeSavepointAsync(jobID, savepointTriggerReason)
}
Comment on lines +488 to +491
Copy link
Contributor

@elanv elanv Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashken @functicons
It seems that savepoint should be triggered when err == nil.
When I tested, sometimes I found that the savepoint was not triggered and only status.job.lastSavepointTriggerTime was updated.

And FlinkCluster status is updated in updateStatus function of reconciler, therefore if you have a plan to make a new PR, it might be worth considering how to call the status update function once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damm I tested the version b4 the CR change with that line, sorry about that

}
}

log.Info("Job is not finished yet, no action", "jobID", jobID)
return requeueResult, nil
}
Expand Down Expand Up @@ -575,14 +573,15 @@ func (reconciler *ClusterReconciler) getFlinkJobID() string {
return ""
}

func (reconciler *ClusterReconciler) restartJob() error {
func (reconciler *ClusterReconciler) restartJob(shouldTakeSavepoint bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashken @functicons restartJob is just for restarting the job so there is no need to trigger savepoint here.

var log = reconciler.log
var observedJob = reconciler.observed.job
var observedFlinkJob = reconciler.observed.flinkJobStatus.flinkJob

log.Info("Stopping Flink job to restart", "", observedFlinkJob)
shouldTakeSavepoint = shouldTakeSavepoint && canTakeSavepoint(*reconciler.observed.cluster)

var err = reconciler.cancelRunningJobs(false /* takeSavepoint */)
var err = reconciler.cancelRunningJobs(shouldTakeSavepoint /* takeSavepoint */)
if err != nil {
return err
}
Expand Down Expand Up @@ -744,19 +743,31 @@ func (reconciler *ClusterReconciler) shouldTakeSavepoint() (bool, string) {
return false, ""
}

var nextOkTriggerTime = getNextOkTime(jobStatus.LastSavepointTriggerTime, SavepointTimeoutSec)
if time.Now().Before(nextOkTriggerTime) {
return false, ""
}

// First savepoint.
if len(jobStatus.LastSavepointTime) == 0 {
return true, v1beta1.SavepointTriggerReasonScheduled
return true, v1beta1.SavepointTriggerReasonScheduledInitial
}

// Interval expired.
var tc = &TimeConverter{}
var lastTime = tc.FromString(jobStatus.LastSavepointTime)
var nextTime = lastTime.Add(
time.Duration(int64(*jobSpec.AutoSavepointSeconds) * int64(time.Second)))
// Scheduled, check if next trigger time arrived.
var nextTime = getNextOkTime(jobStatus.LastSavepointTime, int64(*jobSpec.AutoSavepointSeconds))
return time.Now().After(nextTime), v1beta1.SavepointTriggerReasonScheduled
}

// Convert raw time to object and add `addedSeconds` to it
func getNextOkTime(rawTime string, addedSeconds int64) time.Time {
var tc = &TimeConverter{}
var lastTriggerTime = time.Time{}
if len(rawTime) != 0 {
lastTriggerTime = tc.FromString(rawTime)
}
return lastTriggerTime.Add(time.Duration(addedSeconds * int64(time.Second)))
}

// Trigger savepoint for a job then return savepoint status to update.
func (reconciler *ClusterReconciler) takeSavepointAsync(jobID string, triggerReason string) (*v1beta1.SavepointStatus, error) {
var log = reconciler.log
Expand Down Expand Up @@ -819,6 +830,14 @@ func (reconciler *ClusterReconciler) takeSavepoint(
return err
}

func (reconciler *ClusterReconciler) updateSavepointTriggerTimeStatus() error {
var cluster = v1beta1.FlinkCluster{}
reconciler.observed.cluster.DeepCopyInto(&cluster)
var jobStatus = cluster.Status.Components.Job
setTimestamp(&jobStatus.LastSavepointTriggerTime)
return reconciler.k8sClient.Status().Update(reconciler.context, &cluster)
}

func (reconciler *ClusterReconciler) updateSavepointStatus(
savepointStatus flinkclient.SavepointStatus) error {
var cluster = v1beta1.FlinkCluster{}
Expand Down
2 changes: 1 addition & 1 deletion controllers/flinkcluster_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ const (
ControlRetries = "retries"
ControlMaxRetries = "3"

SavepointTimeoutSec = 60
SavepointTimeoutSec = 900 // 15 mins
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it configurable as a field in the job spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the best case, I increased it for the moment but I think we need to check the jobmanager's API to see SP status in the next SP PR. Do you think its bad we increased it to 15 mins? Is there a case where some1 will want to take SP every <15mins?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is actually the "minimal interval for triggering 2 savepoints", right? The name "Timeout" could be confusing, it might be mis-interpreted as "the timeout for taking a savepoint (before considering it as a failure)".

It is hard to determine the value. For example, I just took a savepoint 10 mins ago, but now I want to update my job, and I don't want to lose the state for the recent 10 mins, so I want it to take another savepoint before the update. Why do we need to introduce an arbitrary limit here?

Copy link
Contributor

@elanv elanv Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three variables related to triggering savepoint.

  • SavepointAgeForJobUpdateSec: savepoint age limit required for update progress
  • SavepointRequestRetryIntervalSec: retry interval for savepoint failure on update
  • SavepointTimeoutSec: savepoint timeout

In some cases, the savepoint may no longer proceed due to some errors, but the job manager may return the status normally. In that case, SavepointTimeoutSec is used to handle the timeout. For the jobs that require a long time to create savepoints, it would be better to change this variable to be user-configurable and set its default value large enough.

SavepointTimeoutSec = 60
RevisionNameLabel = "flinkoperator.k8s.io/revision-name"
// TODO: need to be user configurable
SavepointAgeForJobUpdateSec = 300
SavepointRequestRetryIntervalSec = 10

Copy link
Contributor

@elanv elanv Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that it is possible to set the checkpoint timeout with the Flink configuration. In my opinion, it would be better to remove the Flink operator's savepoint timeout routine to resolve the second issue and guide related Flink configuration.

note: https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#execution-checkpointing-timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be better solved in another PR, this one provides a mitigation for clusters that SP takes more than a few seconds.
I think we should discuss how we want to solve this in another issue and consider to get the info about the SP (is there an active one, was it timeout, etc..) from the jobmanager itself.
I can make this one a part of the crd for now and then later delete it when it will no longer be needed (in another PR)
WTYT @elanv @functicons

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, let's address it in another PR.

Copy link
Contributor

@elanv elanv Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. When a checkpoint timeout occurs in Flink jobmanager, the savepoint state falls to "falied", so I don't think the first savepoint needs to be identified. The second issue is occurring because the default Flink checkpoint timeout is 10 minutes, but SavepointTimeoutSec is less than that. I think it's okay to handle that part in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which's the next PRs/issues? This one? #420

It seems to work as the "minimal interval for triggering 2 savepoints" but some docs shows autoSavepointSeconds: 300 as an example value and I actually specify the value. Is this limitation a temporary workaround?

Copy link
Contributor

@elanv elanv Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SavepointTimeoutSec is just savepoint timeout and autoSavepointSeconds is the savepoint trigger interval as you mentioned. And #420 is the PR to improve savepoint routines including this issue.


RevisionNameLabel = "flinkoperator.k8s.io/revision-name"

Expand Down
65 changes: 35 additions & 30 deletions controllers/flinkcluster_util_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -230,9 +230,10 @@ func TestShouldUpdateJob(t *testing.T) {
cluster: &v1beta1.FlinkCluster{
Status: v1beta1.FlinkClusterStatus{
Components: v1beta1.FlinkClusterComponentsStatus{Job: &v1beta1.JobStatus{
State: v1beta1.JobStateRunning,
LastSavepointTime: tc.ToString(savepointTime),
SavepointLocation: "gs://my-bucket/savepoint-123",
State: v1beta1.JobStateRunning,
LastSavepointTime: tc.ToString(savepointTime),
LastSavepointTriggerTime: tc.ToString(savepointTime),
SavepointLocation: "gs://my-bucket/savepoint-123",
}},
CurrentRevision: "1", NextRevision: "2",
},
Expand Down Expand Up @@ -264,9 +265,10 @@ func TestShouldUpdateJob(t *testing.T) {
cluster: &v1beta1.FlinkCluster{
Status: v1beta1.FlinkClusterStatus{
Components: v1beta1.FlinkClusterComponentsStatus{Job: &v1beta1.JobStatus{
State: v1beta1.JobStateRunning,
LastSavepointTime: tc.ToString(savepointTime),
SavepointLocation: "gs://my-bucket/savepoint-123",
State: v1beta1.JobStateRunning,
LastSavepointTime: tc.ToString(savepointTime),
LastSavepointTriggerTime: tc.ToString(savepointTime),
SavepointLocation: "gs://my-bucket/savepoint-123",
}},
CurrentRevision: "1", NextRevision: "2",
},
Expand Down Expand Up @@ -325,9 +327,10 @@ func TestIsSavepointUpToDate(t *testing.T) {
var savepointTime = time.Now()
var observeTime = savepointTime.Add(time.Second * 100)
var jobStatus = v1beta1.JobStatus{
State: v1beta1.JobStateFailed,
LastSavepointTime: tc.ToString(savepointTime),
SavepointLocation: "gs://my-bucket/savepoint-123",
State: v1beta1.JobStateFailed,
LastSavepointTime: tc.ToString(savepointTime),
LastSavepointTriggerTime: tc.ToString(savepointTime),
SavepointLocation: "gs://my-bucket/savepoint-123",
}
var update = isSavepointUpToDate(observeTime, jobStatus)
assert.Equal(t, update, true)
Expand All @@ -336,9 +339,10 @@ func TestIsSavepointUpToDate(t *testing.T) {
savepointTime = time.Now()
observeTime = savepointTime.Add(time.Second * 500)
jobStatus = v1beta1.JobStatus{
State: v1beta1.JobStateFailed,
LastSavepointTime: tc.ToString(savepointTime),
SavepointLocation: "gs://my-bucket/savepoint-123",
State: v1beta1.JobStateFailed,
LastSavepointTime: tc.ToString(savepointTime),
LastSavepointTriggerTime: tc.ToString(savepointTime),
SavepointLocation: "gs://my-bucket/savepoint-123",
}
update = isSavepointUpToDate(observeTime, jobStatus)
assert.Equal(t, update, false)
Expand All @@ -347,8 +351,9 @@ func TestIsSavepointUpToDate(t *testing.T) {
savepointTime = time.Now()
observeTime = savepointTime.Add(time.Second * 500)
jobStatus = v1beta1.JobStatus{
State: v1beta1.JobStateFailed,
LastSavepointTime: tc.ToString(savepointTime),
State: v1beta1.JobStateFailed,
LastSavepointTime: tc.ToString(savepointTime),
LastSavepointTriggerTime: tc.ToString(savepointTime),
}
update = isSavepointUpToDate(observeTime, jobStatus)
assert.Equal(t, update, false)
Expand Down Expand Up @@ -408,8 +413,8 @@ func TestIsFlinkAPIReady(t *testing.T) {
Status: v1beta1.FlinkClusterStatus{NextRevision: "cluster-85dc8f749-2"},
},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
tmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
tmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
flinkJobStatus: FlinkJobStatus{flinkJobList: &flinkclient.JobStatusList{}},
}
Expand All @@ -425,10 +430,10 @@ func TestIsFlinkAPIReady(t *testing.T) {
},
Status: v1beta1.FlinkClusterStatus{NextRevision: "cluster-85dc8f749-2"},
},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
tmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
}
ready = isFlinkAPIReady(observed)
assert.Equal(t, ready, false)
Expand All @@ -442,9 +447,9 @@ func TestIsFlinkAPIReady(t *testing.T) {
},
Status: v1beta1.FlinkClusterStatus{NextRevision: "cluster-85dc8f749-2"},
},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
tmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
}
ready = isFlinkAPIReady(observed)
assert.Equal(t, ready, false)
Expand All @@ -458,10 +463,10 @@ func TestIsFlinkAPIReady(t *testing.T) {
},
Status: v1beta1.FlinkClusterStatus{NextRevision: "cluster-85dc8f749-2"},
},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
tmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
}
ready = isFlinkAPIReady(observed)
assert.Equal(t, ready, false)
Expand All @@ -478,11 +483,11 @@ func TestGetUpdateState(t *testing.T) {
Components: v1beta1.FlinkClusterComponentsStatus{Job: &v1beta1.JobStatus{State: v1beta1.JobStateRunning}},
CurrentRevision: "cluster-85dc8f749-2", NextRevision: "cluster-aa5e3a87z-3"},
},
job: &batchv1.Job{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
job: &batchv1.Job{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
tmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
}
var state = getUpdateState(observed)
assert.Equal(t, state, UpdateStatePreparing)
Expand All @@ -497,7 +502,7 @@ func TestGetUpdateState(t *testing.T) {
},
jmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
tmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-85dc8f749"}}},
}
state = getUpdateState(observed)
assert.Equal(t, state, UpdateStateInProgress)
Expand All @@ -510,12 +515,12 @@ func TestGetUpdateState(t *testing.T) {
},
Status: v1beta1.FlinkClusterStatus{CurrentRevision: "cluster-85dc8f749-2", NextRevision: "cluster-aa5e3a87z-3"},
},
job: &batchv1.Job{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
job: &batchv1.Job{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
configMap: &corev1.ConfigMap{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
jmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
tmStatefulSet: &appsv1.StatefulSet{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
jmIngress: &extensionsv1beta1.Ingress{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
jmService: &corev1.Service{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
jmIngress: &extensionsv1beta1.Ingress{ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{RevisionNameLabel: "cluster-aa5e3a87z"}}},
}
state = getUpdateState(observed)
assert.Equal(t, state, UpdateStateFinished)
Expand Down
2 changes: 2 additions & 0 deletions docs/crd.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ FlinkCluster
|__ args
|__ fromSavepoint
|__ allowNonRestoredState
|__ takeSavepointOnUpgrade
|__ autoSavepointSeconds
|__ savepointsDir
|__ savepointGeneration
Expand Down Expand Up @@ -261,6 +262,7 @@ FlinkCluster
* **autoSavepointSeconds** (optional): Automatically take a savepoint to the `savepointsDir` every n seconds.
* **savepointsDir** (optional): Savepoints dir where to store automatically taken savepoints.
* **allowNonRestoredState** (optional): Allow non-restored state, default: false.
* **takeSavepointOnUpgrade** (optional): Should take savepoint before upgrading the job, default: false.
* **savepointGeneration** (optional): Update this field to `jobStatus.savepointGeneration + 1` for a running job
cluster to trigger a new savepoint to `savepointsDir` on demand.
* **parallelism** (optional): Parallelism of the job, default: 1.
Expand Down
2 changes: 1 addition & 1 deletion helm-chart/flink-operator/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v1
name: flink-operator
appVersion: "1.0"
description: A Helm chart for flink on Kubernetes operator
version: "0.2.0"
version: "0.2.1"
keywords:
- flink
home: https://github.com/GoogleCloudPlatform/flink-on-k8s-operator
4 changes: 4 additions & 0 deletions helm-chart/flink-operator/templates/flink-cluster-crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,8 @@ spec:
type: integer
cancelRequested:
type: boolean
takeSavepointOnUpgrade:
type: boolean
className:
type: string
cleanupPolicy:
Expand Down Expand Up @@ -4976,6 +4978,8 @@ spec:
type: string
lastSavepointTime:
type: string
lastSavepointTriggerTime:
type: string
lastSavepointTriggerID:
type: string
name:
Expand Down