Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/10355.enhance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Flatten deployment sub-step enum and prepare deploying infrastructure
156 changes: 60 additions & 96 deletions proposals/BEP-1049-deployment-strategy-handler.md

Large diffs are not rendered by default.

92 changes: 70 additions & 22 deletions proposals/BEP-1049/rolling-update.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep

```
┌──────────────────────────────────────┐
│ Any New routes PROVISIONING? │──Yes──→ provisioning
│ Any New routes PROVISIONING? │──Yes──→ provisioning (wait)
└──────────────────┬───────────────────┘
No
Expand All @@ -53,15 +53,38 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep
progressing
```

### Sub-Step Variants
Rollback is **not** decided by the FSM itself. If all new routes fail, the FSM will keep attempting to create new routes via the surge/unavailable calculation. Eventually the DEPLOYING timeout (30 min) is exceeded and the coordinator transitions the deployment to ROLLING_BACK via the `expired` path.

Each cycle evaluation directly returns one of the shared sub-step variants. Completion is not a sub-step but a signal on `CycleEvaluationResult(sub_step=PROGRESSING, completed=True)` — the coordinator handles revision swap and READY transition directly.
### Route Classification

| Sub-Step | Condition | Handler Action |
|----------|-----------|----------------|
| **provisioning** | New routes are PROVISIONING | DeployingProvisioningHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** | Calculated surge/unavailable, created/terminated routes | DeployingProgressingHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** (`completed=True`) | No Old routes and New healthy >= desired_replicas | Coordinator → atomic revision swap + DEPLOYING→READY |
Routes are classified by revision and status:

| Category | Condition | Description |
|----------|-----------|-------------|
| `old_active` | revision != deploying_revision, is_active() | Old routes currently serving traffic |
| `new_provisioning` | revision == deploying_revision, PROVISIONING | New routes being created |
| `new_healthy` | revision == deploying_revision, HEALTHY | New routes ready to serve |
| `new_unhealthy` | revision == deploying_revision, UNHEALTHY/DEGRADED | New routes with issues |
| `new_failed` | revision == deploying_revision, FAILED/TERMINATED | New routes that failed |

### Handler Flow

All DEPLOYING deployments are handled by `DeployingProvisioningHandler`, which stays in the PROVISIONING sub-step throughout the entire deployment lifecycle. The handler runs the strategy evaluator each cycle:

| Result | Condition | Handler Action |
|--------|-----------|----------------|
| **success** | Evaluator returns COMPLETED (no Old routes, New healthy >= desired) | Coordinator transitions to READY |
| **need_retry** | Route mutations executed (create/drain) | Stays in DEPLOYING/PROVISIONING, history recorded |
| **skipped** | No changes — routes still provisioning or waiting | No transition; coordinator checks for timeout |
| **expired** | Skipped deployment exceeds DEPLOYING timeout (30 min) | Coordinator transitions to DEPLOYING/ROLLING_BACK |

When a deployment transitions to ROLLING_BACK, the `DeployingRollingBackHandler` clears `deploying_revision` and transitions directly to READY.

### Safety Guards

- **Zero-downtime protection**: When `max_unavailable < desired`, never terminates ALL old routes until at least one new route is healthy
- **Deadlock prevention**: `RollingUpdateSpec` validator ensures at least one of `max_surge` or `max_unavailable` is positive
- **Timeout-based rollback**: The FSM does not detect failure — the coordinator's timeout mechanism handles it. If the deployment cannot complete within the DEPLOYING timeout (30 min), the coordinator transitions to ROLLING_BACK via the `expired` path

## max_surge / max_unavailable Calculation

Expand Down Expand Up @@ -107,6 +130,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ healthy=3, min_available=2 → can_terminate=1 │
│ │
│ → Create 1 New, Terminate 1 Old │
│ → need_retry (route mutations executed) │
└─────────────────────────────────────────────────────┘
Expand All @@ -115,7 +139,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ Old: [■ ■] (2 healthy) │
│ New: [◇] (1 provisioning) │
│ │
│ → PROVISIONING exists → wait
│ → PROVISIONING exists → skipped (wait)
└─────────────────────────────────────────────────────┘
Expand All @@ -129,6 +153,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ healthy=3, min_available=2 → can_terminate=1 │
│ │
│ → Create 1 New, Terminate 1 Old │
│ → need_retry (route mutations executed) │
└─────────────────────────────────────────────────────┘
Expand All @@ -137,7 +162,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ Old: [■] (1 healthy) │
│ New: [■ ◇] (1 healthy, 1 provisioning) │
│ │
│ → PROVISIONING exists → wait
│ → PROVISIONING exists → skipped (wait)
└─────────────────────────────────────────────────────┘
Expand All @@ -151,6 +176,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ healthy=3, min_available=2 → can_terminate=1 │
│ │
│ → Create 1 New, Terminate 1 Old │
│ → need_retry (route mutations executed) │
└─────────────────────────────────────────────────────┘
Expand All @@ -159,7 +185,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ Old: [] │
│ New: [■ ■ ◇] (2 healthy, 1 provisioning) │
│ │
│ → PROVISIONING exists → wait
│ → PROVISIONING exists → skipped (wait)
└─────────────────────────────────────────────────────┘
Expand All @@ -170,12 +196,25 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ │
│ No Old and New >= desired_replicas → completed │
│ → deploying_revision → current_revision swap │
│ → DEPLOYINGREADY state transition
│ → successcoordinator transitions to READY
└─────────────────────────────────────────────────────┘

Legend: ■ = healthy, ◇ = provisioning
```

## Timeout and Rollback

Deploying timeout is handled through the coordinator's generic `expired` transition mechanism:

1. `DeployingProvisioningHandler` declares `expired → DEPLOYING/ROLLING_BACK` in `status_transitions()`
2. Each cycle, the coordinator checks `result.skipped` deployments against the DEPLOYING timeout (30 min)
3. Timeout is measured using `phase_started_at` from `DeploymentWithHistory` — the `created_at` of the first scheduling history record for this handler phase
4. `phase_started_at` is stable across retries: history records with same phase/error_code/to_status are merged (only `attempts` incremented, `created_at` unchanged)
5. Timed-out deployments transition to DEPLOYING/ROLLING_BACK
6. `DeployingRollingBackHandler` clears `deploying_revision` and transitions to READY

No separate timeout handler or periodic task is needed — timeout checking is built into the coordinator's standard transition handling.

## Component Structure

```
Expand Down Expand Up @@ -209,7 +248,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ │ old_active: old + is_active() │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ Route changes returned (applied by coordinator):
│ Route changes returned (applied by applier):
│ ┌────────────────────────────────────────────────────┐ │
│ │ rollout_specs: RouteCreatorSpec( │ │
│ │ revision_id = deploying_revision, │ │
Expand All @@ -223,14 +262,18 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
┌──────────────────────────────────────────────────────────────┐
│ Per-Sub-Step Handlers (coordinator generic path) │
│ DeployingProvisioningHandler │
│ (single handler for entire DEPLOYING lifecycle) │
│ │
│ PROVISIONING → DeployingProvisioningHandler │
│ next_status: DEPLOYING → coordinator records history │
│ completed → success → coordinator transitions to READY │
│ route mutations → need_retry → stays in PROVISIONING │
│ no changes → skipped → coordinator checks timeout │
│ evaluation errors → errors → classified by coordinator │
│ │
│ PROGRESSING → DeployingProgressingHandler │
│ next_status: DEPLOYING → coordinator records history │
│ completed=True → coordinator atomic revision swap + READY │
│ DeployingRollingBackHandler │
│ (cleanup on timeout) │
│ │
│ clear deploying_revision → success → READY │
└──────────────────────────────────────────────────────────────┘
```

Expand All @@ -242,11 +285,16 @@ When all Old routes are removed and New routes reach desired_replicas or above a
completed determination (evaluator)
Coordinator._transition_completed_deployments()
StrategyResultApplier.apply()
→ Atomic transaction:
1. complete_deployment_revision_swap(ids)
current_revision = deploying_revision
deploying_revision = NULL
2. DEPLOYING → READY lifecycle transition
3. History recording
2. Returns completed_ids in StrategyApplyResult
DeployingProvisioningHandler
→ completed_ids → successes
→ coordinator transitions DEPLOYING → READY
→ History recording
```
3 changes: 3 additions & 0 deletions src/ai/backend/common/dto/manager/deployment/response.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,9 @@ class DeploymentDTO(BaseModel):
deployment_policy: DeploymentPolicyDTO | None = Field(
default=None, description="Deployment rollout policy"
)
sub_step: str | None = Field(
default=None, description="Current deployment sub-step (e.g. provisioning, rolling_back)"
)


class CreateDeploymentResponse(BaseResponseModel):
Expand Down
1 change: 1 addition & 0 deletions src/ai/backend/manager/api/rest/deployment/adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,7 @@ def convert_to_dto(self, data: ModelDeploymentData) -> DeploymentDTO:
default_deployment_strategy=data.default_deployment_strategy,
current_revision=current_revision,
deployment_policy=deployment_policy,
sub_step=data.sub_step,
)

def build_querier(self, request: SearchDeploymentsRequest) -> BatchQuerier:
Expand Down
48 changes: 25 additions & 23 deletions src/ai/backend/manager/data/deployment/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,10 @@ def is_active(self) -> bool:
def is_inactive(self) -> bool:
return self in self.inactive_route_statuses()

def is_provisioning(self) -> bool:
"""PROVISIONING or DEGRADED (still warming up, health checks not yet passing)."""
return self in (RouteStatus.PROVISIONING, RouteStatus.DEGRADED)

def termination_priority(self) -> int:
priority_map = {
RouteStatus.UNHEALTHY: 1,
Expand Down Expand Up @@ -148,45 +152,42 @@ class RouteTrafficStatus(enum.StrEnum):
# ========== Status Transition Types (BEP-1030) ==========


class DeploymentSubStatus(enum.StrEnum):
"""Base class for deployment lifecycle sub-statuses.
class DeploymentLifecycleSubStep(enum.StrEnum):
"""Sub-steps within deployment lifecycle phases.

Each lifecycle type can define its own sub-status enum by
inheriting from this class. For example, DEPLOYING handlers
use ``DeploymentSubStep`` (provisioning, rolling_back, …).
Member names are prefixed with the lifecycle phase they belong to
(e.g. ``DEPLOYING_``). String values are stored in the database as-is.
"""

# -- DEPLOYING phase --
DEPLOYING_PROVISIONING = "provisioning"
"""New revision routes are being provisioned and old routes are being drained."""
DEPLOYING_ROLLING_BACK = "rolling_back"
"""Clearing deploying_revision and transitioning to READY."""
DEPLOYING_COMPLETED = "completed"
"""All strategy conditions satisfied; triggers revision swap."""

class DeploymentSubStep(DeploymentSubStatus):
"""Sub-steps for the DEPLOYING lifecycle phase.

- PROVISIONING: New revision routes are being provisioned and old routes
are being drained. The main handler for rolling updates.
- ROLLING_BACK: Clearing deploying_revision and transitioning to READY.
- COMPLETED: All strategy conditions satisfied; triggers revision swap.
"""

PROVISIONING = "provisioning"
ROLLING_BACK = "rolling_back"
COMPLETED = "completed"
@classmethod
def deploying_handler_sub_steps(cls) -> tuple[DeploymentLifecycleSubStep, ...]:
"""Sub-steps that have their own deploying handler (excludes COMPLETED, which is an evaluator outcome)."""
return (cls.DEPLOYING_PROVISIONING, cls.DEPLOYING_ROLLING_BACK)


@dataclass(frozen=True)
class DeploymentLifecycleStatus:
"""Target lifecycle state for a deployment status transition.

Pairs an EndpointLifecycle with an optional sub-status to provide
Pairs an EndpointLifecycle with an optional sub-step to provide
context about which sub-step led to this transition.

Attributes:
lifecycle: The target endpoint lifecycle state
sub_status: Optional sub-status indicating what determined this
transition. Concrete values come from DeploymentSubStatus
subclasses (e.g. DeploymentSubStep for DEPLOYING handlers).
sub_step: Optional sub-step indicating what determined this
transition (e.g. DEPLOYING_* members for DEPLOYING handlers).
"""

lifecycle: EndpointLifecycle
sub_status: DeploymentSubStatus | None = None
sub_step: DeploymentLifecycleSubStep | None = None


@dataclass(frozen=True)
Expand Down Expand Up @@ -376,7 +377,7 @@ class DeploymentInfo:
current_revision_id: UUID | None = None
policy: DeploymentPolicyData | None = None
deploying_revision_id: UUID | None = None
sub_step: DeploymentSubStep | None = None
sub_step: DeploymentLifecycleSubStep | None = None

def resolve_revision_spec(self, revision_id: UUID) -> ModelRevisionSpec | None:
"""Find a ModelRevisionSpec by revision_id from model_revisions."""
Expand Down Expand Up @@ -569,6 +570,7 @@ class ModelDeploymentData:
created_user_id: UUID
policy: DeploymentPolicyData | None = None
access_token_ids: list[UUID] | None = None
sub_step: DeploymentLifecycleSubStep | None = None


class DeploymentOrderField(enum.StrEnum):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from ai.backend.common.events.hub.hub import EventHub
from ai.backend.common.types import AgentId
from ai.backend.logging.utils import BraceStyleAdapter
from ai.backend.manager.data.deployment.types import DeploymentSubStep
from ai.backend.manager.data.deployment.types import DeploymentLifecycleSubStep
from ai.backend.manager.scheduler.types import ScheduleType
from ai.backend.manager.sokovan.deployment.coordinator import DeploymentCoordinator
from ai.backend.manager.sokovan.deployment.route.coordinator import RouteCoordinator
Expand Down Expand Up @@ -93,15 +93,15 @@ async def handle_do_deployment_lifecycle_if_needed(
) -> None:
"""Handle deployment lifecycle if needed event (checks marks)."""
lifecycle_type = DeploymentLifecycleType(ev.lifecycle_type)
sub_step = DeploymentSubStep(ev.sub_step) if ev.sub_step else None
sub_step = DeploymentLifecycleSubStep(ev.sub_step) if ev.sub_step is not None else None
await self._deployment_coordinator.process_if_needed(lifecycle_type, sub_step)

async def handle_do_deployment_lifecycle(
self, _context: None, _agent_id: str, ev: DoDeploymentLifecycleEvent
) -> None:
"""Handle deployment lifecycle event (unconditional)."""
lifecycle_type = DeploymentLifecycleType(ev.lifecycle_type)
sub_step = DeploymentSubStep(ev.sub_step) if ev.sub_step else None
sub_step = DeploymentLifecycleSubStep(ev.sub_step) if ev.sub_step is not None else None
await self._deployment_coordinator.process_deployment_lifecycle(lifecycle_type, sub_step)

async def handle_do_route_lifecycle_if_needed(
Expand Down
Loading
Loading