- Parent: BEP-1049: Zero-Downtime Deployment Strategy Architecture
- Related: Blue-Green Deployment Strategy
Rolling Update is a deployment strategy that gradually replaces existing routes (Old Revision) with new routes (New Revision). Two parameters — max_surge and max_unavailable — control the replacement pace, and the deployment remains in DEPLOYING state across multiple cycles until the full replacement is complete.
RollingUpdateSpec:
max_surge: int = 1 # Max additional routes to create simultaneously beyond desired_replicas
max_unavailable: int = 0 # Max unavailable routes to allow relative to desired_replicas
The endpoints table has two columns for revision management:
deploying_revision— The revision currently being deployed (NULL when no deployment is in progress)current_revision— The revision currently serving traffic
The DeploymentStrategyEvaluator periodically evaluates each Rolling Update deployment. Each invocation follows this FSM:
┌──────────────────────────────────────┐
│ Any New routes PROVISIONING? │──Yes──→ provisioning (wait)
└──────────────────┬───────────────────┘
No
▼
┌──────────────────────────────────────────────────┐
│ No Old and New healthy >= desired_replicas? │──Yes──→ completed (replacement done)
└──────────────────┬───────────────────────────────┘
No
▼
┌──────────────────────────────────────┐
│ Calculate max_surge/max_unavailable │
│ │
│ to_create = min(can_create, │
│ need_create) │
│ to_terminate = min(can_terminate, │
│ old_active) │
│ │
│ → Create New routes (ACTIVE) │
│ → Terminate Old routes (TERMINATING)│
└──────────────────────────────────────┘
│
▼
progressing
Rollback is not decided by the FSM itself. If all new routes fail, the FSM will keep attempting to create new routes via the surge/unavailable calculation. Eventually the DEPLOYING timeout (30 min) is exceeded and the coordinator transitions the deployment to ROLLING_BACK via the expired path.
Routes are classified by revision and status:
| Category | Condition | Description |
|---|---|---|
old_active |
revision != deploying_revision, is_active() | Old routes currently serving traffic |
new_provisioning |
revision == deploying_revision, PROVISIONING | New routes being created |
new_healthy |
revision == deploying_revision, HEALTHY | New routes ready to serve |
new_unhealthy |
revision == deploying_revision, UNHEALTHY/DEGRADED | New routes with issues |
new_failed |
revision == deploying_revision, FAILED/TERMINATED | New routes that failed |
All DEPLOYING deployments are handled by DeployingProvisioningHandler, which stays in the PROVISIONING sub-step throughout the entire deployment lifecycle. The handler runs the strategy evaluator each cycle:
| Result | Condition | Handler Action |
|---|---|---|
| success | Evaluator returns COMPLETED (no Old routes, New healthy >= desired) | Coordinator transitions to READY |
| need_retry | Route mutations executed (create/drain) | Stays in DEPLOYING/PROVISIONING, history recorded |
| skipped | No changes — routes still provisioning or waiting | No transition; coordinator checks for timeout |
| expired | Skipped deployment exceeds DEPLOYING timeout (30 min) | Coordinator transitions to DEPLOYING/ROLLING_BACK |
When a deployment transitions to ROLLING_BACK, the DeployingRollingBackHandler clears deploying_revision and transitions directly to READY.
- Zero-downtime protection: When
max_unavailable < desired, never terminates ALL old routes until at least one new route is healthy - Deadlock prevention:
RollingUpdateSpecvalidator ensures at least one ofmax_surgeormax_unavailableis positive - Timeout-based rollback: The FSM does not detect failure — the coordinator's timeout mechanism handles it. If the deployment cannot complete within the DEPLOYING timeout (30 min), the coordinator transitions to ROLLING_BACK via the
expiredpath
Example with desired_replicas = 3, max_surge = 1, max_unavailable = 1:
Constraints:
┌──────────────────────────────────────────────────────────┐
│ max_total = desired_replicas + max_surge = 4 │
│ → Total active routes cannot exceed 4 │
│ │
│ min_available = desired_replicas - max_unavailable = 2 │
│ → Healthy routes must not drop below 2 │
└──────────────────────────────────────────────────────────┘
Creation calculation:
┌────────────────────────────────────────────────────────────────────┐
│ can_create = max(0, max_total - total_active) │
│ need_create = max(0, desired_replicas - new_healthy - new_prov) │
│ to_create = min(can_create, need_create) │
└────────────────────────────────────────────────────────────────────┘
Termination calculation:
┌──────────────────────────────────────────────────────────┐
│ healthy_count = new_healthy + old_active │
│ can_terminate = max(0, healthy_count - min_available) │
│ to_terminate = min(can_terminate, old_active) │
└──────────────────────────────────────────────────────────┘
desired_replicas=3, max_surge=1, max_unavailable=1:
Cycle 0 (initial state)
┌─────────────────────────────────────────────────────┐
│ Old: [■ ■ ■] (3 healthy) │
│ New: [] │
│ │
│ total_active=3, max_total=4 → can_create=1 │
│ need_create=3 → to_create=1 │
│ healthy=3, min_available=2 → can_terminate=1 │
│ │
│ → Create 1 New, Terminate 1 Old │
│ → need_retry (route mutations executed) │
└─────────────────────────────────────────────────────┘
│
▼
Cycle 1 (New provisioning)
┌─────────────────────────────────────────────────────┐
│ Old: [■ ■] (2 healthy) │
│ New: [◇] (1 provisioning) │
│ │
│ → PROVISIONING exists → skipped (wait) │
└─────────────────────────────────────────────────────┘
│
▼
Cycle 2 (1 New healthy)
┌─────────────────────────────────────────────────────┐
│ Old: [■ ■] (2 healthy) │
│ New: [■] (1 healthy) │
│ │
│ total_active=3, max_total=4 → can_create=1 │
│ need_create=2 → to_create=1 │
│ healthy=3, min_available=2 → can_terminate=1 │
│ │
│ → Create 1 New, Terminate 1 Old │
│ → need_retry (route mutations executed) │
└─────────────────────────────────────────────────────┘
│
▼
Cycle 3 (New provisioning)
┌─────────────────────────────────────────────────────┐
│ Old: [■] (1 healthy) │
│ New: [■ ◇] (1 healthy, 1 provisioning) │
│ │
│ → PROVISIONING exists → skipped (wait) │
└─────────────────────────────────────────────────────┘
│
▼
Cycle 4 (2 New healthy)
┌─────────────────────────────────────────────────────┐
│ Old: [■] (1 healthy) │
│ New: [■ ■] (2 healthy) │
│ │
│ total_active=3, max_total=4 → can_create=1 │
│ need_create=1 → to_create=1 │
│ healthy=3, min_available=2 → can_terminate=1 │
│ │
│ → Create 1 New, Terminate 1 Old │
│ → need_retry (route mutations executed) │
└─────────────────────────────────────────────────────┘
│
▼
Cycle 5 (waiting for provisioning)
┌─────────────────────────────────────────────────────┐
│ Old: [] │
│ New: [■ ■ ◇] (2 healthy, 1 provisioning) │
│ │
│ → PROVISIONING exists → skipped (wait) │
└─────────────────────────────────────────────────────┘
│
▼
Cycle 6 (completed)
┌─────────────────────────────────────────────────────┐
│ Old: [] │
│ New: [■ ■ ■] (3 healthy) │
│ │
│ No Old and New >= desired_replicas → completed │
│ → deploying_revision → current_revision swap │
│ → success → coordinator transitions to READY │
└─────────────────────────────────────────────────────┘
Legend: ■ = healthy, ◇ = provisioning
Deploying timeout is handled through the coordinator's generic expired transition mechanism:
DeployingProvisioningHandlerdeclaresexpired → DEPLOYING/ROLLING_BACKinstatus_transitions()- Each cycle, the coordinator checks
result.skippeddeployments against the DEPLOYING timeout (30 min) - Timeout is measured using
phase_started_atfromDeploymentWithHistory— thecreated_atof the first scheduling history record for this handler phase phase_started_atis stable across retries: history records with same phase/error_code/to_status are merged (onlyattemptsincremented,created_atunchanged)- Timed-out deployments transition to DEPLOYING/ROLLING_BACK
DeployingRollingBackHandlerclearsdeploying_revisionand transitions to READY
No separate timeout handler or periodic task is needed — timeout checking is built into the coordinator's standard transition handling.
┌──────────────────────────────────────────────────────────────┐
│ DeploymentStrategyEvaluator │
│ (evaluator — strategy FSM + route changes) │
│ │
│ evaluate(deployments) → EvaluationResult │
│ 1. Load policy_map, route_map │
│ 2. For each deployment: │
│ policy = policy_map[deployment.id] │
│ strategy = policy.strategy │
│ 3. Dispatch by strategy: │
│ ROLLING → rolling_update_evaluate(...) │
│ 4. Aggregate route changes + group by sub_step │
│ Coordinator applies route changes after evaluation │
└──────────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ rolling_update_evaluate(deployment, routes, spec) │
│ (evaluator internal strategy function) │
│ │
│ Route classification: │
│ ┌────────────────────────────────────────────────────┐ │
│ │ old_routes: revision != deploying_revision │ │
│ │ new_routes: revision == deploying_revision │ │
│ │ │ │
│ │ new_provisioning: new + PROVISIONING │ │
│ │ new_healthy: new + HEALTHY │ │
│ │ old_active: old + is_active() │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ Route changes returned (applied by applier): │
│ ┌────────────────────────────────────────────────────┐ │
│ │ rollout_specs: RouteCreatorSpec( │ │
│ │ revision_id = deploying_revision, │ │
│ │ traffic_status = ACTIVE ← differs from BG │ │
│ │ ) │ │
│ │ │ │
│ │ drain_route_ids: old route IDs │ │
│ │ → status = TERMINATING │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ DeployingProvisioningHandler │
│ (single handler for entire DEPLOYING lifecycle) │
│ │
│ completed → success → coordinator transitions to READY │
│ route mutations → need_retry → stays in PROVISIONING │
│ no changes → skipped → coordinator checks timeout │
│ evaluation errors → errors → classified by coordinator │
│ │
│ DeployingRollingBackHandler │
│ (cleanup on timeout) │
│ │
│ clear deploying_revision → success → READY │
└──────────────────────────────────────────────────────────────┘
When all Old routes are removed and New routes reach desired_replicas or above as healthy:
completed determination (evaluator)
│
▼
StrategyResultApplier.apply()
→ Atomic transaction:
1. complete_deployment_revision_swap(ids)
current_revision = deploying_revision
deploying_revision = NULL
2. Returns completed_ids in StrategyApplyResult
│
▼
DeployingProvisioningHandler
→ completed_ids → successes
→ coordinator transitions DEPLOYING → READY
→ History recording