Author	Jeongseok Kang (jskang@lablup.com)
Status	Draft
Created	2026-04-27
Created-Version	26.5.0
Target-Version
Implemented-Version

Native Session Retry

Related Issues

GitHub Epic: #11320
GitHub: #11321

Motivation

Backend.AI core has no session-level retry. A BATCH session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in ERROR, and the user must manually re-create it.

The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (account_manager/models/utils.py), kernel restart on the agent (agent/agent.py:restarting_kernels), and tenacity-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec."

This pushes the retry concern out to every higher-level orchestrator on top of Backend.AI. Each one re-implements the same logic, with inconsistent semantics. Pushing retry into core gives:

A single source of truth for retry semantics — backoff, jitter, eligibility — shared by every caller.
Resilience for plain batch workloads without requiring an external orchestrator.
Reduced duplication; orchestrators above Backend.AI can thin out their retry layers.

Current Design

Session statuses are defined in src/ai/backend/manager/data/session/types.py:30-51:

PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED

Terminal statuses with no further transitions: ERROR, TERMINATED, CANCELLED. SessionStatus.retriable_statuses() (line 118) classifies which startup states are scheduling-retriable, but there is no notion of re-creating a terminal ERROR session.

Session creation flows through API handler → SessionService.create_from_params() → repository → SessionRow. SessionRow.creation_id already exists as an idempotency key. There are no fields for parent_session_id, retry_count, max_retries, or a retry policy.

The termination event handler (event_dispatcher/handlers/session.py) listens to session.terminated / session.error but has no retry decision hook.

No prior BEP covers session retry or fault tolerance.

Proposed Design

Mental model

max_retries > 0 means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to retries. The classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes.

`RetryPolicy` schema

A Pydantic DTO accepted at session creation, modeled on Airflow's parameter surface:

class BackoffStrategy(StrEnum):
    FIXED = "fixed"
    EXPONENTIAL = "exponential"

class JitterMode(StrEnum):
    NONE = "none"
    DETERMINISTIC = "deterministic"
    RANDOM = "random"

class RetryEligibleCause(StrEnum):
    AGENT_TRANSIENT = "agent_transient"
    SCHEDULER_TIMEOUT = "scheduler_timeout"
    IMAGE_PULL_FAILURE = "image_pull_failure"
    KERNEL_NONZERO_EXIT = "kernel_nonzero_exit"
    OOM_KILLED = "oom_killed"
    UNKNOWN = "unknown"

    @classmethod
    def defaults(cls) -> frozenset["RetryEligibleCause"]:
        return frozenset({
            cls.AGENT_TRANSIENT, cls.SCHEDULER_TIMEOUT,
            cls.IMAGE_PULL_FAILURE, cls.KERNEL_NONZERO_EXIT,
            cls.OOM_KILLED, cls.UNKNOWN,
        })

class RetryPolicy(BaseModel):
    max_retries: NonNegativeInt = 0
    retry_delay: PositiveFloat = 60.0
    backoff: BackoffStrategy = BackoffStrategy.FIXED
    backoff_multiplier: PositiveFloat = 2.0
    max_retry_delay: PositiveFloat | None = 3600.0
    jitter: JitterMode = JitterMode.DETERMINISTIC
    jitter_ratio: confloat(ge=0, le=1) = 0.25
    eligible_causes: frozenset[RetryEligibleCause] = Field(
        default_factory=RetryEligibleCause.defaults
    )
    emit_retry_events: bool = True

Mapping to Airflow:

Airflow	`RetryPolicy`
`retries`	`max_retries` (count, total attempts = `1 + max_retries`)
`retry_delay`	`retry_delay` (seconds)
`retry_exponential_backoff` (multiplier)	`backoff: fixed\|exponential` + `backoff_multiplier`
`max_retry_delay` (with 24 h hard ceiling)	`max_retry_delay` (24 h hard ceiling preserved)
SHA1-deterministic jitter	`jitter` (selectable: none / deterministic / random), `jitter_ratio`
Exception-typed eligibility	Structural enum `RetryEligibleCause`
`on_retry_callback`	`session.retry_scheduled` / `session.retry_exhausted` events
`default_args` precedence	Per-session > project/domain default > etcd cluster default
`email_on_retry`	Subsumed by event subscription via webhook plugin

Deviations from Airflow and their reasons:

No callback parameter. Keeps the policy serializable and the server's behavior auditable. Backend.AI is event-driven; downstream consumers subscribe to session.retry_* events.
Structural cause enum, not exception types. Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process.
max_retries is a count. Total attempts = 1 + max_retries, matching Backend.AI conventions and the existing pipeline orchestrator.

Failure classification

A central classify_failure(session, status_data) → RetryEligibleCause. Hardcoded non-retriable causes outside the enum: USER_CANCELLED, VALIDATION_ERROR, QUOTA_EXCEEDED. Users cannot opt these into retry.

Cause	Default eligible	Notes
`AGENT_TRANSIENT`	yes	Lost heartbeat, agent restart mid-run.
`SCHEDULER_TIMEOUT`	yes	Kernel-creation timeout under cluster pressure.
`IMAGE_PULL_FAILURE`	yes	Typo wastes a few seconds with backoff; registry blip is real.
`KERNEL_NONZERO_EXIT`	yes	The most common reason batch users want retry.
`OOM_KILLED`	yes	Retry without resource bump usually fails again, but exhausting `max_retries` is cheap.
`UNKNOWN`	yes	Conservative for unclassified failures.
`USER_CANCELLED`	hardcoded never	Permanent.
`VALIDATION_ERROR` / `QUOTA_EXCEEDED`	hardcoded never	Permanent.

Backoff formula

base = retry_delay                                                    if backoff == FIXED
       min(retry_delay * backoff_multiplier ** retry_count,            otherwise
           max_retry_delay or MAX_RETRY_DELAY)
delay = apply_jitter(base, mode=jitter, ratio=jitter_ratio,
                     seed=(session_id, retry_count))
delay = min(delay, max_retry_delay or MAX_RETRY_DELAY)

MAX_RETRY_DELAY is a hard 24 h ceiling. Deterministic jitter takes SHA1(session_id || retry_count) mod (base * jitter_ratio), yielding reproducible delays — useful for tests. Random jitter samples uniformly in [base, base * (1 + jitter_ratio)).

Defaults precedence

Three layers, matching Airflow's default_args propagation:

Per-session policy in the create request.
Project / domain default (new optional field, admin-managed).
Cluster default in etcd: config/manager/retry_policy_default. Ship default: max_retries=0 → no behavior change.

Effective policy = deep-merge top-down; per-session wins.

Data model

One Alembic migration adds to sessions:

parent_session_id : UUID NULL  (self-FK)
retry_count       : INT  NOT NULL DEFAULT 0
max_retries       : INT  NOT NULL DEFAULT 0
retry_policy      : JSONB NULL
retry_cause       : TEXT NULL

Rationale: parent_session_id, retry_count, max_retries are first-class columns because they are queried for filters and joins. The rest live in JSONB. No new history table — the chain is a linked list of real SessionRows, each with its own status, kernels, logs, and status_data. Cheaper than a separate history table and consistent with Backend.AI's existing model.

Decision and dispatch

A new handler at event_dispatcher/handlers/session_retry.py subscribes to session.terminated / session.error:

Load session. If retry_count >= max_retries → emit session.retry_exhausted and return.
Classify failure. If cause not in eligible_causes (or in hardcoded never-retry set) → return.
Acquire row lock with select_for_update(). If a child with deterministic creation_id = parent.creation_id + ":retry:" + (retry_count + 1) already exists → return (idempotency).
Compute delay per the formula above.
Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep.

The retry path calls SessionService.create_from_params() with a CreateFromParamsAction derived from the parent (image, mounts, resource_slots, env, cluster spec, batch entrypoint). The child inherits retry_policy, sets parent_session_id to the parent, and retry_count = parent.retry_count + 1.

No new RETRYING status. The parent goes to ERROR as today; the child starts in PENDING as today. A computed retry_state field on the API tells clients "attempt N of M" / "this session has a pending child." This avoids touching the scheduler state machine.

API surface

REST v2 (api/rest/v2/sessions/):

POST /sessions — accept optional retry_policy in the request body.
GET /sessions/{id} — return parent_session_id, retry_count, max_retries, retry_policy, retry_cause, plus computed retry_chain (oldest → newest IDs).
GET /sessions/{id}/attempts — return the chain with status of each attempt.

GraphQL v2: mirror in api/gql/session/types.py — parentSession, retryCount, maxRetries, retryPolicy, retryCause, resolver retryChain.

Client SDK v2 + CLI v2: expose new fields; ./bai session info shows attempt N of M and links to the parent.

No retry mutation in v1. Manual retry is deferred until the auto path is stable.

Observability

Counters: bai_session_retry_scheduled_total{cause}, bai_session_retry_exhausted_total{cause}, bai_session_retry_succeeded_total.
Events: session.retry_scheduled, session.retry_exhausted — consumable by the webhook plugin. Replace the role of Airflow's on_retry_callback for downstream consumers.
Audit log entry per retry dispatch (auto, cause, attempt N of M).

Migration / Compatibility

Backward compatibility

Default max_retries=0 ⇒ zero behavior change for existing callers.
All new columns are nullable or default to safe zero values.
Existing GraphQL and REST clients continue to work; new fields are additive.

Migration steps

Apply Alembic migration adding the five columns. Migration is idempotent and backportable per src/ai/backend/manager/models/alembic/README.md.
Deploy manager with retry handler and surface, default off via etcd.
Operators opt in by setting cluster default or per-session policy.
External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental.

Breaking changes

None.

Implementation Plan

Six PRs, each tracked by its own sub-issue under #11320:

BEP draft (this document) — #11321.
Foundation: RetryPolicy DTO, classify_failure module, backoff utility (with deterministic jitter). Pure functions, no I/O, unit-test heavy.
Schema: Alembic migration, SessionRow field expansion, repository read/write for retry chain. Backportable.
Retry engine: event handler, SessionService.create_from_params extension, defaults precedence (project/domain/etcd), counters/events/audit.
API surface: REST v2 and GraphQL v2 fields, attempts endpoint.
Client: SDK v2, CLI v2 (./bai session info retry view), user docs.

Tests live with the code under test. Cross-cutting integration tests (transient → retry → success; exhaustion path; concurrent dispatch idempotency; jitter determinism) ship with the retry-engine PR.

Estimated effort: three to four weeks for one engineer.

Decision Log

Date	Decision	Rationale
2026-04-27	Batch sessions only in v1	Interactive sessions are user-driven and do not fit auto-retry semantics.
2026-04-27	Each retry is a fresh session, linked via `parent_session_id`	Matches existing pipeline orchestrator semantics; avoids reusing kernels/scratch and the complexity that would entail.
2026-04-27	No new `RETRYING` status	Parent goes to `ERROR`, child starts `PENDING` — avoids touching the scheduler state machine. Computed `retry_state` on the API is enough for clients.
2026-04-27	Linked-list chain, not a separate history table	The chain is already a list of real `SessionRow`s; no need to duplicate.
2026-04-27	Structural `RetryEligibleCause` enum, not exception-typed	Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process.
2026-04-27	`KERNEL_NONZERO_EXIT` is in the default eligible set	`max_retries > 0` should be the only knob a typical user touches; matches Airflow's "retry on failure, period" model.
2026-04-27	`USER_CANCELLED` / `VALIDATION_ERROR` / `QUOTA_EXCEEDED` are hardcoded non-retriable	These are permanent by definition; users cannot opt them into retry.
2026-04-27	No retry mutation in v1	Auto path stabilizes first; manual retry's interaction with `max_retries` is itself a design decision.
2026-04-27	Idempotency via deterministic child `creation_id`	Reuses an existing field; no new uniqueness constraint required.
2026-04-27	Deterministic jitter seed = `(session_id, retry_count)`	Reproducible for tests; trade-off vs. unpredictability is acceptable for a server-side retry.

Open Questions

Quota accounting: do retries count against concurrent-session limits? Likely yes, but needs a product call.
Retry-storm kill switch: should the etcd default be a single boolean toggle, a rate limit, or both? Leaning toward a boolean for v1 with a rate limit deferred.
Manual retry in v2: counts toward max_retries or independent? Decide before exposing.
Default for max_retry_delay: 1 h is conservative for long-running batch jobs that might benefit from a longer cooldown after repeated failures. Revisit after telemetry.
Project/domain defaults table location: extend an existing table or add a small new project_retry_defaults table?

References

Working draft: docs/investigation/native-session-retry-plan.md
Apache Airflow retry implementation: airflow-core/src/airflow/models/taskinstance.py:1109-1159
Existing scheduler state-machine BEP: BEP-1030
Alembic backport strategy: src/ai/backend/manager/models/alembic/README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Session Retry

Related Issues

Motivation

Current Design

Proposed Design

Mental model

`RetryPolicy` schema

Failure classification

Backoff formula

Defaults precedence

Data model

Decision and dispatch

API surface

Observability

Migration / Compatibility

Backward compatibility

Migration steps

Breaking changes

Implementation Plan

Decision Log

Open Questions

References

FilesExpand file tree

BEP-1053-native-session-retry.md

Latest commit

History

BEP-1053-native-session-retry.md

File metadata and controls

Native Session Retry

Related Issues

Motivation

Current Design

Proposed Design

Mental model

RetryPolicy schema

Failure classification

Backoff formula

Defaults precedence

Data model

Decision and dispatch

API surface

Observability

Migration / Compatibility

Backward compatibility

Migration steps

Breaking changes

Implementation Plan

Decision Log

Open Questions

References

`RetryPolicy` schema