Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/10355.enhance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Flatten deployment sub-step enum and prepare deploying infrastructure
156 changes: 60 additions & 96 deletions proposals/BEP-1049-deployment-strategy-handler.md

Large diffs are not rendered by default.

92 changes: 70 additions & 22 deletions proposals/BEP-1049/rolling-update.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep

```
┌──────────────────────────────────────┐
│ Any New routes PROVISIONING? │──Yes──→ provisioning
│ Any New routes PROVISIONING? │──Yes──→ provisioning (wait)
└──────────────────┬───────────────────┘
No
Expand All @@ -53,15 +53,38 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep
progressing
```

### Sub-Step Variants
Rollback is **not** decided by the FSM itself. If all new routes fail, the FSM will keep attempting to create new routes via the surge/unavailable calculation. Eventually the DEPLOYING timeout (30 min) is exceeded and the coordinator transitions the deployment to ROLLING_BACK via the `expired` path.

Each cycle evaluation directly returns one of the shared sub-step variants. Completion is not a sub-step but a signal on `CycleEvaluationResult(sub_step=PROGRESSING, completed=True)` — the coordinator handles revision swap and READY transition directly.
### Route Classification

| Sub-Step | Condition | Handler Action |
|----------|-----------|----------------|
| **provisioning** | New routes are PROVISIONING | DeployingProvisioningHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** | Calculated surge/unavailable, created/terminated routes | DeployingProgressingHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** (`completed=True`) | No Old routes and New healthy >= desired_replicas | Coordinator → atomic revision swap + DEPLOYING→READY |
Routes are classified by revision and status:

| Category | Condition | Description |
|----------|-----------|-------------|
| `old_active` | revision != deploying_revision, is_active() | Old routes currently serving traffic |
| `new_provisioning` | revision == deploying_revision, PROVISIONING | New routes being created |
| `new_healthy` | revision == deploying_revision, HEALTHY | New routes ready to serve |
| `new_unhealthy` | revision == deploying_revision, UNHEALTHY/DEGRADED | New routes with issues |
| `new_failed` | revision == deploying_revision, FAILED/TERMINATED | New routes that failed |

### Handler Flow

All DEPLOYING deployments are handled by `DeployingProvisioningHandler`, which stays in the PROVISIONING sub-step throughout the entire deployment lifecycle. The handler runs the strategy evaluator each cycle:

| Result | Condition | Handler Action |
|--------|-----------|----------------|
| **success** | Evaluator returns COMPLETED (no Old routes, New healthy >= desired) | Coordinator transitions to READY |
| **need_retry** | Route mutations executed (create/drain) | Stays in DEPLOYING/PROVISIONING, history recorded |
| **skipped** | No changes — routes still provisioning or waiting | No transition; coordinator checks for timeout |
| **expired** | Skipped deployment exceeds DEPLOYING timeout (30 min) | Coordinator transitions to DEPLOYING/ROLLING_BACK |

When a deployment transitions to ROLLING_BACK, the `DeployingRollingBackHandler` clears `deploying_revision` and transitions directly to READY.

### Safety Guards

- **Zero-downtime protection**: When `max_unavailable < desired`, never terminates ALL old routes until at least one new route is healthy
- **Deadlock prevention**: `RollingUpdateSpec` validator ensures at least one of `max_surge` or `max_unavailable` is positive
- **Timeout-based rollback**: The FSM does not detect failure — the coordinator's timeout mechanism handles it. If the deployment cannot complete within the DEPLOYING timeout (30 min), the coordinator transitions to ROLLING_BACK via the `expired` path

## max_surge / max_unavailable Calculation

Expand Down Expand Up @@ -107,6 +130,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ healthy=3, min_available=2 → can_terminate=1 │
│ │
│ → Create 1 New, Terminate 1 Old │
│ → need_retry (route mutations executed) │
└─────────────────────────────────────────────────────┘
Expand All @@ -115,7 +139,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ Old: [■ ■] (2 healthy) │
│ New: [◇] (1 provisioning) │
│ │
│ → PROVISIONING exists → wait
│ → PROVISIONING exists → skipped (wait)
└─────────────────────────────────────────────────────┘
Expand All @@ -129,6 +153,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ healthy=3, min_available=2 → can_terminate=1 │
│ │
│ → Create 1 New, Terminate 1 Old │
│ → need_retry (route mutations executed) │
└─────────────────────────────────────────────────────┘
Expand All @@ -137,7 +162,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ Old: [■] (1 healthy) │
│ New: [■ ◇] (1 healthy, 1 provisioning) │
│ │
│ → PROVISIONING exists → wait
│ → PROVISIONING exists → skipped (wait)
└─────────────────────────────────────────────────────┘
Expand All @@ -151,6 +176,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ healthy=3, min_available=2 → can_terminate=1 │
│ │
│ → Create 1 New, Terminate 1 Old │
│ → need_retry (route mutations executed) │
└─────────────────────────────────────────────────────┘
Expand All @@ -159,7 +185,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ Old: [] │
│ New: [■ ■ ◇] (2 healthy, 1 provisioning) │
│ │
│ → PROVISIONING exists → wait
│ → PROVISIONING exists → skipped (wait)
└─────────────────────────────────────────────────────┘
Expand All @@ -170,12 +196,25 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ │
│ No Old and New >= desired_replicas → completed │
│ → deploying_revision → current_revision swap │
│ → DEPLOYINGREADY state transition
│ → successcoordinator transitions to READY
└─────────────────────────────────────────────────────┘

Legend: ■ = healthy, ◇ = provisioning
```

## Timeout and Rollback

Deploying timeout is handled through the coordinator's generic `expired` transition mechanism:

1. `DeployingProvisioningHandler` declares `expired → DEPLOYING/ROLLING_BACK` in `status_transitions()`
2. Each cycle, the coordinator checks `result.skipped` deployments against the DEPLOYING timeout (30 min)
3. Timeout is measured using `phase_started_at` from `DeploymentWithHistory` — the `created_at` of the first scheduling history record for this handler phase
4. `phase_started_at` is stable across retries: history records with same phase/error_code/to_status are merged (only `attempts` incremented, `created_at` unchanged)
5. Timed-out deployments transition to DEPLOYING/ROLLING_BACK
6. `DeployingRollingBackHandler` clears `deploying_revision` and transitions to READY

No separate timeout handler or periodic task is needed — timeout checking is built into the coordinator's standard transition handling.

## Component Structure

```
Expand Down Expand Up @@ -209,7 +248,7 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ │ old_active: old + is_active() │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ Route changes returned (applied by coordinator):
│ Route changes returned (applied by applier):
│ ┌────────────────────────────────────────────────────┐ │
│ │ rollout_specs: RouteCreatorSpec( │ │
│ │ revision_id = deploying_revision, │ │
Expand All @@ -223,14 +262,18 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
┌──────────────────────────────────────────────────────────────┐
│ Per-Sub-Step Handlers (coordinator generic path) │
│ DeployingProvisioningHandler │
│ (single handler for entire DEPLOYING lifecycle) │
│ │
│ PROVISIONING → DeployingProvisioningHandler │
│ next_status: DEPLOYING → coordinator records history │
│ completed → success → coordinator transitions to READY │
│ route mutations → need_retry → stays in PROVISIONING │
│ no changes → skipped → coordinator checks timeout │
│ evaluation errors → errors → classified by coordinator │
│ │
│ PROGRESSING → DeployingProgressingHandler │
│ next_status: DEPLOYING → coordinator records history │
│ completed=True → coordinator atomic revision swap + READY │
│ DeployingRollingBackHandler │
│ (cleanup on timeout) │
│ │
│ clear deploying_revision → success → READY │
└──────────────────────────────────────────────────────────────┘
```

Expand All @@ -242,11 +285,16 @@ When all Old routes are removed and New routes reach desired_replicas or above a
completed determination (evaluator)
Coordinator._transition_completed_deployments()
StrategyResultApplier.apply()
→ Atomic transaction:
1. complete_deployment_revision_swap(ids)
current_revision = deploying_revision
deploying_revision = NULL
2. DEPLOYING → READY lifecycle transition
3. History recording
2. Returns completed_ids in StrategyApplyResult
DeployingProvisioningHandler
→ completed_ids → successes
→ coordinator transitions DEPLOYING → READY
→ History recording
```
3 changes: 3 additions & 0 deletions src/ai/backend/common/dto/manager/deployment/response.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,9 @@ class DeploymentDTO(BaseModel):
deployment_policy: DeploymentPolicyDTO | None = Field(
default=None, description="Deployment rollout policy"
)
sub_step: str | None = Field(
default=None, description="Current deployment sub-step (e.g. provisioning, rolling_back)"
)


class CreateDeploymentResponse(BaseResponseModel):
Expand Down
1 change: 1 addition & 0 deletions src/ai/backend/manager/api/rest/deployment/adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,7 @@ def convert_to_dto(self, data: ModelDeploymentData) -> DeploymentDTO:
default_deployment_strategy=data.default_deployment_strategy,
current_revision=current_revision,
deployment_policy=deployment_policy,
sub_step=data.sub_step,
)

def build_querier(self, request: SearchDeploymentsRequest) -> BatchQuerier:
Expand Down
67 changes: 38 additions & 29 deletions src/ai/backend/manager/data/deployment/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
ModelDeploymentStatus,
ReadinessStatus,
)
from ai.backend.manager.errors.deployment import DeploymentRevisionNotFound

if TYPE_CHECKING:
from ai.backend.manager.data.session.types import SchedulingResult, SubStepResult
Expand Down Expand Up @@ -121,6 +122,10 @@ def is_active(self) -> bool:
def is_inactive(self) -> bool:
return self in self.inactive_route_statuses()

def is_provisioning(self) -> bool:
"""PROVISIONING or DEGRADED (still warming up, health checks not yet passing)."""
return self in (RouteStatus.PROVISIONING, RouteStatus.DEGRADED)

def termination_priority(self) -> int:
priority_map = {
RouteStatus.UNHEALTHY: 1,
Expand Down Expand Up @@ -148,45 +153,42 @@ class RouteTrafficStatus(enum.StrEnum):
# ========== Status Transition Types (BEP-1030) ==========


class DeploymentSubStatus(enum.StrEnum):
"""Base class for deployment lifecycle sub-statuses.
class DeploymentLifecycleSubStep(enum.StrEnum):
"""Sub-steps within deployment lifecycle phases.

Each lifecycle type can define its own sub-status enum by
inheriting from this class. For example, DEPLOYING handlers
use ``DeploymentSubStep`` (provisioning, rolling_back, …).
Member names are prefixed with the lifecycle phase they belong to
(e.g. ``DEPLOYING_``). String values are stored in the database as-is.
"""

# -- DEPLOYING phase --
DEPLOYING_PROVISIONING = "deploying_provisioning"
"""New revision routes are being provisioned and old routes are being drained."""
DEPLOYING_ROLLING_BACK = "deploying_rolling_back"
"""Clearing deploying_revision and transitioning to READY."""
DEPLOYING_COMPLETED = "deploying_completed"
"""All strategy conditions satisfied; triggers revision swap."""

class DeploymentSubStep(DeploymentSubStatus):
"""Sub-steps for the DEPLOYING lifecycle phase.

- PROVISIONING: New revision routes are being provisioned and old routes
are being drained. The main handler for rolling updates.
- ROLLING_BACK: Clearing deploying_revision and transitioning to READY.
- COMPLETED: All strategy conditions satisfied; triggers revision swap.
"""

PROVISIONING = "provisioning"
ROLLING_BACK = "rolling_back"
COMPLETED = "completed"
@classmethod
def deploying_handler_sub_steps(cls) -> tuple[DeploymentLifecycleSubStep, ...]:
"""Sub-steps that have their own deploying handler (excludes COMPLETED, which is an evaluator outcome)."""
return (cls.DEPLOYING_PROVISIONING, cls.DEPLOYING_ROLLING_BACK)


@dataclass(frozen=True)
class DeploymentLifecycleStatus:
"""Target lifecycle state for a deployment status transition.

Pairs an EndpointLifecycle with an optional sub-status to provide
Pairs an EndpointLifecycle with an optional sub-step to provide
context about which sub-step led to this transition.

Attributes:
lifecycle: The target endpoint lifecycle state
sub_status: Optional sub-status indicating what determined this
transition. Concrete values come from DeploymentSubStatus
subclasses (e.g. DeploymentSubStep for DEPLOYING handlers).
sub_step: Optional sub-step indicating what determined this
transition (e.g. DEPLOYING_* members for DEPLOYING handlers).
"""

lifecycle: EndpointLifecycle
sub_status: DeploymentSubStatus | None = None
sub_step: DeploymentLifecycleSubStep | None = None


@dataclass(frozen=True)
Expand Down Expand Up @@ -376,13 +378,19 @@ class DeploymentInfo:
current_revision_id: UUID | None = None
policy: DeploymentPolicyData | None = None
deploying_revision_id: UUID | None = None
sub_step: DeploymentSubStep | None = None

def resolve_revision_spec(self, revision_id: UUID) -> ModelRevisionSpec | None:
"""Find a ModelRevisionSpec by revision_id from model_revisions."""
return next(
(r for r in self.model_revisions if r.revision_id == revision_id),
None,
sub_step: DeploymentLifecycleSubStep | None = None

def resolve_revision_spec(self, revision_id: UUID) -> ModelRevisionSpec:
"""Find a ModelRevisionSpec by revision_id from model_revisions.

Raises:
DeploymentRevisionNotFound: If the revision is not found.
"""
for revision in self.model_revisions:
if revision.revision_id == revision_id:
return revision
raise DeploymentRevisionNotFound(
f"Revision {revision_id} not found in model_revisions of deployment {self.id}"
)


Expand Down Expand Up @@ -569,6 +577,7 @@ class ModelDeploymentData:
created_user_id: UUID
policy: DeploymentPolicyData | None = None
access_token_ids: list[UUID] | None = None
sub_step: DeploymentLifecycleSubStep | None = None


class DeploymentOrderField(enum.StrEnum):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from ai.backend.common.events.hub.hub import EventHub
from ai.backend.common.types import AgentId
from ai.backend.logging.utils import BraceStyleAdapter
from ai.backend.manager.data.deployment.types import DeploymentSubStep
from ai.backend.manager.data.deployment.types import DeploymentLifecycleSubStep
from ai.backend.manager.scheduler.types import ScheduleType
from ai.backend.manager.sokovan.deployment.coordinator import DeploymentCoordinator
from ai.backend.manager.sokovan.deployment.route.coordinator import RouteCoordinator
Expand Down Expand Up @@ -93,15 +93,15 @@ async def handle_do_deployment_lifecycle_if_needed(
) -> None:
"""Handle deployment lifecycle if needed event (checks marks)."""
lifecycle_type = DeploymentLifecycleType(ev.lifecycle_type)
sub_step = DeploymentSubStep(ev.sub_step) if ev.sub_step else None
sub_step = DeploymentLifecycleSubStep(ev.sub_step) if ev.sub_step is not None else None
await self._deployment_coordinator.process_if_needed(lifecycle_type, sub_step)

async def handle_do_deployment_lifecycle(
self, _context: None, _agent_id: str, ev: DoDeploymentLifecycleEvent
) -> None:
"""Handle deployment lifecycle event (unconditional)."""
lifecycle_type = DeploymentLifecycleType(ev.lifecycle_type)
sub_step = DeploymentSubStep(ev.sub_step) if ev.sub_step else None
sub_step = DeploymentLifecycleSubStep(ev.sub_step) if ev.sub_step is not None else None
await self._deployment_coordinator.process_deployment_lifecycle(lifecycle_type, sub_step)

async def handle_do_route_lifecycle_if_needed(
Expand Down
Loading
Loading