Skip to content

[Bug] After rollback-directly cancellation, canaryStatus retains stale "step paused" state #339

@LittleChimera

Description

@LittleChimera

Bug Report

When a canary release is cancelled because the workload is rolled back directly (image reverted to the stable revision), the controller drains the BatchRelease and finalises traffic routing correctly, but it does not clear the canary sub-status. The rollout ends up in phase: Healthy with Progressing: False / Completed, yet canaryStatus still holds the step-machine fields from the moment the rollback was triggered (currentStepIndex, currentStepState: StepPaused, podTemplateHash of the failed revision, canaryReplicas, canaryReadyReplicas, …).

Two visible symptoms:

  1. The rollout looks "stuck at step N paused" forever in tooling that gates on canaryStatus.currentStepState.
  2. kubectl kruise rollout approve is a no-op — the dispatch in reconcileRollout only runs reconcileRolloutProgressing while phase == Progressing, so patching NextStepIndex has no effect.

Reproduce

  1. Canary rollout with at least 2 steps, e.g. step 1 = 1 replica + traffic split, step 2 = 100%.
  2. Deploy a bad version → batch 1 brings up 1 canary pod that fails its health check (or any external bake mechanism).
  3. Revert the workload's pod template to the stable revision (rollback-directly path).
  4. Wait for the canary RS to scale to 0 and the deployment to settle on the stable RS.
  5. Inspect: kubectl get rollout.rollouts.kruise.io <name> -o yaml.

Observed:

status:
  phase: Healthy
  canaryStatus:
    currentStepIndex: 1
    currentStepState: StepPaused
    podTemplateHash: <hash-of-failed-revision>
    canaryReplicas: 1
    canaryReadyReplicas: 1
  conditions:
  - type: Progressing
    status: "False"
    reason: Completed
    message: Rollout progressing has been cancelled

kubectl kruise rollout approve rollout.rollouts.kruise.io/<name> does nothing.

Root cause

reconcileRolloutProgressing (pkg/controller/rollout/rollout_progressing.go) handles the cancelling case by calling doFinalising and, when it returns done, transitioning the Progressing condition to Completed and the Succeeded condition to False. It never clears the sub-status.

case v1alpha1.ProgressingReasonCancelling:
    rolloutContext.FinalizeReason = v1beta1.FinaliseReasonRollback
    done, err = r.doFinalising(rolloutContext)
    if err != nil {
        return nil, err
    } else if done {
        progressingStateTransition(newStatus, corev1.ConditionFalse, v1alpha1.ProgressingReasonCompleted, "Rollout progressing has been canceled")
        setRolloutSucceededCondition(newStatus, corev1.ConditionFalse)
    }

Compare to handleContinuousRelease, which calls c.NewStatus.Clear() after the same kind of reset — that path leaves a clean status, the rollback path does not.

Expected

After cancellation finalising completes, canaryStatus/blueGreenStatus should be cleared (same treatment as continuous release), so subsequent inspection and downstream tooling see the rollout as idle, not "step paused".

Fix

Call newStatus.Clear() in the cancelling → completed branch. PR coming.

Environment

  • openkruise/rollouts master (v0.6.2 / commit b2600e9), Kubernetes 1.x, Deployment workload.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions