Skip to content

feat(BA-3435): Implement Rolling Update deployment strategy#9997

Merged
HyeockJinKim merged 15 commits into
mainfrom
BA-3435_3
Mar 23, 2026
Merged

feat(BA-3435): Implement Rolling Update deployment strategy#9997
HyeockJinKim merged 15 commits into
mainfrom
BA-3435_3

Conversation

@jopemachine
Copy link
Copy Markdown
Member

@jopemachine jopemachine commented Mar 12, 2026

resolves #7384 (BA-3435)

Overview

Implements the Rolling Update deployment strategy FSM (BEP-1049) — a pure-function evaluator that gradually replaces old-revision routes with new-revision routes, respecting surge and unavailability budgets.

Also refactors DEPLOYING timeout handling: removes the separate CHECK_DEPLOYING_TIMEOUT lifecycle type and DeployingTimeoutHandler, replacing them with the standard expired transition mechanism on DeployingProvisioningHandler. Timeout is now checked via phase_started_at from scheduling history (which is not reset on retries due to history merge), eliminating the need for the deploying_started_at column.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│  rolling_update.evaluate_cycle()          (strategy/rolling_update.py)      │
│                                                                             │
│  Pure function: (DeploymentInfo, routes, RollingUpdateSpec) → CycleResult   │
│                                                                             │
│  FSM:                                                                       │
│    1. Classify routes by revision_id:                                       │
│         old_active:       revision != deploying_revision, is_active()       │
│         new_provisioning: revision == deploying_revision, PROVISIONING      │
│         new_healthy:      revision == deploying_revision, HEALTHY           │
│         new_unhealthy:    revision == deploying_revision, UNHEALTHY/DEGRADED│
│         new_failed:       revision == deploying_revision, FAILED/TERMINATED │
│                                                                             │
│    2. new_provisioning? ───────────────────→ PROVISIONING (wait)            │
│    3. no old + new_healthy >= desired? ────→ COMPLETED                      │
│    4. all new failed/unhealthy? ──────────→ ROLLED_BACK                     │
│    5. Compute surge/unavailable budget:                                     │
│         max_total     = desired + max_surge                                 │
│         min_available = desired - max_unavailable                           │
│         to_create     = min(max_total - current, desired - new_live)        │
│         to_terminate  = min(available - min_available, old_active)          │
│       ─────────────────────────────────────→ PROGRESSING                    │
│                                              + RouteChanges(rollout_specs,  │
│                                                             drain_route_ids)│
└─────────────────────────────────────────────────────────────────────────────┘

Cycle-by-Cycle Example (desired=3, max_surge=1, max_unavailable=1)

Cycle 0 (initial)          Cycle 1 (provisioning)     Cycle 2 (1 new healthy)
Old: [■ ■ ■]               Old: [■ ■]                 Old: [■ ■]
New: []                     New: [◇]                   New: [■]
→ create 1, terminate 1    → wait (PROVISIONING)      → create 1, terminate 1
        │                          │                          │
        ▼                          ▼                          ▼
Cycle 3 (provisioning)     Cycle 4 (2 new healthy)    Cycle 5 (provisioning)
Old: [■]                    Old: [■]                   Old: []
New: [■ ◇]                 New: [■ ■]                 New: [■ ■ ◇]
→ wait (PROVISIONING)      → create 1, terminate 1    → wait (PROVISIONING)
                                   │                          │
                                   ▼                          ▼
                            Cycle 6 (completed)
                            Old: []
                            New: [■ ■ ■]
                            → COMPLETED — revision swap + DEPLOYING → READY

Legend: ■ = healthy, ◇ = provisioning

Safety Guards

  • Zero-downtime protection: When max_unavailable < desired, never terminates ALL old routes until at least one new route is healthy
  • Deadlock prevention: RollingUpdateSpec validator ensures at least one of max_surge or max_unavailable is positive
  • Rollback detection: If all new routes are FAILED_TO_START or UNHEALTHY (none healthy, none provisioning), the FSM returns ROLLED_BACK

Deploying Timeout Refactor

Previously, deploying timeout was handled by a separate DeployingTimeoutHandler registered under CHECK_DEPLOYING_TIMEOUT lifecycle type, running as an independent periodic task. This has been unified with the standard expired transition mechanism:

  • DeployingProvisioningHandler now declares an expired transition (→ DEPLOYING/ROLLING_BACK)
  • The coordinator checks skipped deployments for timeout using phase_started_at from scheduling history
  • phase_started_at is stable across retries (history records are merged via should_merge_with, incrementing attempts without changing created_at)
  • The deploying_started_at column and its migration have been removed entirely

Key Types

Type Location Purpose
StrategyCycleResult strategy/types.py Single deployment FSM result: sub_step + route_changes
RouteChanges strategy/types.py Route mutations: rollout_specs (Creator) + drain_route_ids
RollingUpdateSpec models/deployment_policy/row.py Config: max_surge, max_unavailable
AbstractDeploymentStrategy strategy/types.py Strategy interface that RollingUpdateStrategy implements

Changed Files

File Change
strategy/rolling_update.py Rolling update FSM implementation (stub → full)
handlers/deploying.py Remove DeployingTimeoutHandler, add expired transition to provisioning handler
coordinator.py Remove CHECK_DEPLOYING_TIMEOUT lifecycle, add skipped-timeout check, use phase_started_at uniformly
types.py Remove CHECK_DEPLOYING_TIMEOUT enum member
data/deployment/types.py Remove deploying_started_at field
models/endpoint/row.py Remove deploying_started_at column
test_rolling_update.py 54 unit tests across 18 test classes covering all FSM branches

Test Coverage

Test Class Scenarios
TestBasicFSMStates PROVISIONING, COMPLETED, ROLLED_BACK, PROGRESSING
TestMaxSurge Surge budget limits, surge=0
TestMaxUnavailable Unavailability budget, unavailable=0
TestCombinedSurgeAndUnavailable Both parameters active
TestMultiCycleProgression Multi-step rollout sequences
TestMixedRouteStatuses UNHEALTHY + HEALTHY mixed states
TestTerminationPriority Old route termination ordering
TestEdgeCases Empty routes, desired=0, no deploying revision
TestRouteCreatorSpecs Creator spec correctness
TestRealisticScenario Full 3-replica rolling update simulation
TestDeadlockAndStall surge=0/unavailable=0 deadlock prevention
TestDesiredReplicaCount Various replica counts (1, 5, 10)
TestScaleDownDuringRollingUpdate Scale-down during active rollout
TestConcurrentOperations Multiple revision edge cases

Related

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--9997.org.readthedocs.build/en/9997/


📚 Documentation preview 📚: https://sorna-ko--9997.org.readthedocs.build/ko/9997/

@github-actions github-actions Bot added size:XL 500~ LoC comp:manager Related to Manager component labels Mar 12, 2026
@jopemachine jopemachine added this to the 26.3 milestone Mar 12, 2026
@github-actions github-actions Bot added comp:common Related to Common component require:db-migration Automatically set when alembic migrations are added or updated labels Mar 16, 2026
@jopemachine jopemachine modified the milestones: 26.3, 26.4 Mar 17, 2026
@jopemachine jopemachine force-pushed the BA-3435_3 branch 2 times, most recently from d7a7761 to 10ddb6b Compare March 18, 2026 05:21
@jopemachine jopemachine removed the require:db-migration Automatically set when alembic migrations are added or updated label Mar 18, 2026
@jopemachine jopemachine force-pushed the BA-3435_3 branch 3 times, most recently from 5939e01 to d296163 Compare March 19, 2026 09:16
@jopemachine jopemachine marked this pull request as ready for review March 20, 2026 06:02
@jopemachine jopemachine requested review from a team, HyeockJinKim and Copilot March 20, 2026 06:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a Rolling Update deployment strategy evaluator (pure-function FSM) and refactors DEPLOYING timeout handling to use the coordinator’s standard expired transition mechanism, while also renaming/standardizing deployment sub-step typing and exposing the sub-step via the deployment API.

Changes:

  • Implement RollingUpdateStrategy.evaluate_cycle() with surge/unavailability budgeting and route mutation outputs.
  • Replace/standardize deployment sub-step handling with DeploymentLifecycleSubStep across coordinator/handlers/repos and add skipped-timeout checks that drive expired → DEPLOYING/ROLLING_BACK.
  • Add fallback revision-spec loading from endpoint-level fields and surface sub_step in deployment DTO/API.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/manager/sokovan/deployment/strategy/test_rolling_update.py New unit tests covering rolling update FSM outcomes and budgeting.
tests/unit/manager/sokovan/deployment/strategy/test_applier.py Update tests for new sub-step enum and completed detection; adjust fixtures.
tests/unit/manager/sokovan/deployment/executor/conftest.py Extend repo mock to support new get_revision_spec_from_endpoint() path.
src/ai/backend/manager/sokovan/deployment/strategy/types.py Update strategy result types to use DeploymentLifecycleSubStep.
src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py Implement rolling update route classification + create/drain decisions.
src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py Adjust bulk route fetching conditions (incl. terminated new-revision routes).
src/ai/backend/manager/sokovan/deployment/strategy/applier.py Update completed detection to DEPLOYING_COMPLETED; simplify applier surface.
src/ai/backend/manager/sokovan/deployment/handlers/deploying.py Add expired transition for provisioning; refactor rolling-back cleanup to repo.
src/ai/backend/manager/sokovan/deployment/handlers/base.py Docstring alignment to new sub-step naming.
src/ai/backend/manager/sokovan/deployment/executor.py Use endpoint-level revision spec fallback when no current revision exists.
src/ai/backend/manager/sokovan/deployment/deployment_controller.py Update controller API to accept DeploymentLifecycleSubStep.
src/ai/backend/manager/sokovan/deployment/coordinator.py Wire sub-step filtering, add skipped-timeout expiration handling, update task specs.
src/ai/backend/manager/services/deployment/service.py Include sub_step in deployment data conversion; update lifecycle marking callsites.
src/ai/backend/manager/repositories/deployment/repository.py Add sub_steps filtering to handler fetch; add get_revision_spec_from_endpoint().
src/ai/backend/manager/repositories/deployment/db_source/db_source.py Implement sub-step filtering + endpoint-based revision spec builder query.
src/ai/backend/manager/repositories/deployment/creators/deployment.py Update lifecycle batch updater spec to use DeploymentLifecycleSubStep.
src/ai/backend/manager/models/endpoint/row.py Switch sub-step column type + add build_revision_spec_from_endpoint() helper.
src/ai/backend/manager/event_dispatcher/handlers/schedule.py Decode deployment sub-step using new enum type.
src/ai/backend/manager/data/deployment/types.py Introduce DeploymentLifecycleSubStep; add RouteStatus.is_provisioning().
src/ai/backend/manager/api/rest/deployment/adapter.py Map sub_step into REST DTO conversion.
src/ai/backend/common/dto/manager/deployment/response.py Add sub_step field to deployment response DTO.
proposals/BEP-1049/rolling-update.md Update BEP doc to match new handler/timeout flow and FSM semantics.
proposals/BEP-1049-deployment-strategy-handler.md Update design doc to reflect 2 DEPLOYING handlers + skipped-timeout expiry behavior.
changes/9997.feature.md Changelog entry for rolling update strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

new_healthy=2, old=3 → can_terminate=2, old=3 → 2 (budget-limited)
new_healthy=4, old=1 → can_terminate=2, old=1 → 1 (old-count-limited)
"""
available_count = classified.new_healthy_count + len(classified.old_active)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_compute_routes_to_terminate() treats every old_active route as "available" by using len(classified.old_active), but old_active includes PROVISIONING/UNHEALTHY/DEGRADED routes because RouteStatus.is_active() returns true for them. This can overestimate available_count and allow terminating additional old routes beyond the min_available budget (potential downtime). Consider tracking/counting healthy old routes separately (or filtering old_active by status == HEALTHY) when computing available_count and can_terminate, while still keeping the full old_active list for termination ordering.

Suggested change
available_count = classified.new_healthy_count + len(classified.old_active)
old_healthy_count = sum(
1 for route in classified.old_active if route.status == RouteStatus.HEALTHY
)
available_count = classified.new_healthy_count + old_healthy_count

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +107
# Fetch non-terminated routes + terminated routes belonging to a
# deploying revision. The FSM needs terminated new-revision routes
# to count accumulated failures for rollback detection, but old
# terminated routes are irrelevant and would bloat the result set.
deploying_revision_ids = {
deployment.deploying_revision_id
for deployment in deployments
if deployment.deploying_revision_id is not None
}
route_conditions: list[QueryCondition] = [
RouteConditions.by_endpoint_ids(endpoint_ids),
]
if deploying_revision_ids:
route_conditions.append(
combine_conditions_or([
RouteConditions.exclude_statuses([RouteStatus.TERMINATED]),
RouteConditions.by_revision_ids(deploying_revision_ids),
])
)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment and query logic here say terminated new-revision routes are needed for "rollback detection", but RollingUpdateStrategy currently never uses new_failed_count for any decision (it only logs it) and the tests/docs state rollback is handled by coordinator timeout. If rollback detection is no longer part of the FSM, consider simplifying the route query back to excluding TERMINATED routes (or updating the comment to reflect the real reason for including terminated routes) to avoid extra result-set bloat and confusion.

Copilot uses AI. Check for mistakes.
Comment on lines +554 to +566
def test_only_failed_new_no_old_rolls_back(self) -> None:
"""Only failed new routes, no old → PROVISIONING (retries creation)."""
deployment = make_deployment(desired=2)
spec = RollingUpdateSpec(max_surge=1, max_unavailable=0)
routes = [
make_route(revision_id=NEW_REV, status=RouteStatus.FAILED_TO_START),
make_route(revision_id=NEW_REV, status=RouteStatus.FAILED_TO_START),
]

result = RollingUpdateStrategy(spec).evaluate_cycle(deployment, routes)

assert result.sub_step == DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name test_only_failed_new_no_old_rolls_back is misleading: the assertion expects DEPLOYING_PROVISIONING, and the docstring also says it stays in PROVISIONING. Consider renaming the test (and/or updating the docstring) to reflect the actual behavior (retry/wait rather than rollback).

Copilot uses AI. Check for mistakes.
Comment on lines 90 to +113
@pytest.fixture
def mixed_summary() -> tuple[StrategyEvaluationSummary, UUID, UUID]:
def rolled_back_summary() -> tuple[StrategyEvaluationSummary, set[UUID]]:
ep_id = uuid4()
summary = _build_summary({ep_id: DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING})
return summary, {ep_id}


@pytest.fixture
def mixed_summary() -> tuple[StrategyEvaluationSummary, UUID, UUID, UUID]:
provisioning_id = uuid4()
completed_id = uuid4()
rolled_back_id = uuid4()
summary = _build_summary(
{
provisioning_id: DeploymentSubStep.PROVISIONING,
completed_id: DeploymentSubStep.COMPLETED,
provisioning_id: DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING,
completed_id: DeploymentLifecycleSubStep.DEPLOYING_COMPLETED,
rolled_back_id: DeploymentLifecycleSubStep.DEPLOYING_PROVISIONING,
},
route_changes=RouteChanges(
rollout_specs=[MagicMock()],
drain_route_ids=[uuid4()],
),
)
return summary, provisioning_id, completed_id
return summary, provisioning_id, completed_id, rolled_back_id
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolled_back_summary fixture is both unused in this test module and misleadingly named (it assigns DEPLOYING_PROVISIONING). Consider removing it (and the extra rolled_back_id in mixed_summary if it isn't needed) or renaming it to match the actual sub_step being tested to keep the applier tests focused and clear.

Copilot uses AI. Check for mistakes.
@jopemachine jopemachine changed the base branch from main to refactor/flatten-deployment-lifecycle-sub-step March 20, 2026 07:09
@jopemachine jopemachine force-pushed the BA-3435_3 branch 7 times, most recently from 0d5fc37 to b94732d Compare March 20, 2026 09:38
Base automatically changed from refactor/flatten-deployment-lifecycle-sub-step to main March 23, 2026 01:52
Comment thread src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py Outdated
@jopemachine jopemachine force-pushed the BA-3435_3 branch 2 times, most recently from de3fb63 to 89391b8 Compare March 23, 2026 05:56
@github-actions github-actions Bot added the area:docs Documentations label Mar 23, 2026
jopemachine and others added 15 commits March 23, 2026 16:39
…rolling update PR

Move non-rolling-update-evaluator changes to the base refactoring PR:
- Coordinator: sub_step filtering, expired transition for skipped deployments
- Deploying handlers: expired transition, rolling_back post_process
- Executor: route creation refactoring
- Repository/DB source: sub_steps filter parameter
- Strategy applier: remove clear_deploying_revision (moved to repo)
- Strategy types: docstring updates
- BEP-1049 proposal updates
- Test fixtures updates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sses/failures/skipped)

Replace the 4-field result (successes, errors, skipped, need_retry) with
the 3-field pattern used by session coordinator: successes, failures, skipped.

Handlers now report all non-success outcomes as failures (DeploymentExecutionError).
The coordinator classifies failures into need_retry/expired/give_up based on
retry count and timeout policy, matching the session side's approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HyeockJinKim HyeockJinKim enabled auto-merge (squash) March 23, 2026 07:41
@HyeockJinKim HyeockJinKim merged commit a8dfe30 into main Mar 23, 2026
32 of 33 checks passed
@HyeockJinKim HyeockJinKim deleted the BA-3435_3 branch March 23, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Rolling Update deployment strategy

3 participants