| Author | Jeongseok Kang (jskang@lablup.com) |
|---|---|
| Status | Draft |
| Created | 2026-04-27 |
| Created-Version | 26.5.0 |
| Target-Version | |
| Implemented-Version |
- GitHub Epic: #11320
- GitHub: #11321
Backend.AI core has no session-level retry. A BATCH session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in ERROR, and the user must manually re-create it.
The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (account_manager/models/utils.py), kernel restart on the agent (agent/agent.py:restarting_kernels), and tenacity-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec."
This pushes the retry concern out to every higher-level orchestrator on top of Backend.AI. Each one re-implements the same logic, with inconsistent semantics. Pushing retry into core gives:
- A single source of truth for retry semantics — backoff, jitter, eligibility — shared by every caller.
- Resilience for plain batch workloads without requiring an external orchestrator.
- Reduced duplication; orchestrators above Backend.AI can thin out their retry layers.
Session statuses are defined in src/ai/backend/manager/data/session/types.py:30-51:
PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED
Terminal statuses with no further transitions: ERROR, TERMINATED, CANCELLED. SessionStatus.retriable_statuses() (line 118) classifies which startup states are scheduling-retriable, but there is no notion of re-creating a terminal ERROR session.
Session creation flows through API handler → SessionService.create_from_params() → repository → SessionRow. SessionRow.creation_id already exists as an idempotency key. There are no fields for parent_session_id, retry_count, max_retries, or a retry policy.
The termination event handler (event_dispatcher/handlers/session.py) listens to session.terminated / session.error but has no retry decision hook.
No prior BEP covers session retry or fault tolerance.
max_retries > 0 means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to retries. The classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes.
A Pydantic DTO accepted at session creation, modeled on Airflow's parameter surface:
class BackoffStrategy(StrEnum):
FIXED = "fixed"
EXPONENTIAL = "exponential"
class JitterMode(StrEnum):
NONE = "none"
DETERMINISTIC = "deterministic"
RANDOM = "random"
class RetryEligibleCause(StrEnum):
AGENT_TRANSIENT = "agent_transient"
SCHEDULER_TIMEOUT = "scheduler_timeout"
IMAGE_PULL_FAILURE = "image_pull_failure"
KERNEL_NONZERO_EXIT = "kernel_nonzero_exit"
OOM_KILLED = "oom_killed"
UNKNOWN = "unknown"
@classmethod
def defaults(cls) -> frozenset["RetryEligibleCause"]:
return frozenset({
cls.AGENT_TRANSIENT, cls.SCHEDULER_TIMEOUT,
cls.IMAGE_PULL_FAILURE, cls.KERNEL_NONZERO_EXIT,
cls.OOM_KILLED, cls.UNKNOWN,
})
class RetryPolicy(BaseModel):
max_retries: NonNegativeInt = 0
retry_delay: PositiveFloat = 60.0
backoff: BackoffStrategy = BackoffStrategy.FIXED
backoff_multiplier: PositiveFloat = 2.0
max_retry_delay: PositiveFloat | None = 3600.0
jitter: JitterMode = JitterMode.DETERMINISTIC
jitter_ratio: confloat(ge=0, le=1) = 0.25
eligible_causes: frozenset[RetryEligibleCause] = Field(
default_factory=RetryEligibleCause.defaults
)
emit_retry_events: bool = TrueMapping to Airflow:
| Airflow | RetryPolicy |
|---|---|
retries |
max_retries (count, total attempts = 1 + max_retries) |
retry_delay |
retry_delay (seconds) |
retry_exponential_backoff (multiplier) |
backoff: fixed|exponential + backoff_multiplier |
max_retry_delay (with 24 h hard ceiling) |
max_retry_delay (24 h hard ceiling preserved) |
| SHA1-deterministic jitter | jitter (selectable: none / deterministic / random), jitter_ratio |
| Exception-typed eligibility | Structural enum RetryEligibleCause |
on_retry_callback |
session.retry_scheduled / session.retry_exhausted events |
default_args precedence |
Per-session > project/domain default > etcd cluster default |
email_on_retry |
Subsumed by event subscription via webhook plugin |
Deviations from Airflow and their reasons:
- No callback parameter. Keeps the policy serializable and the server's behavior auditable. Backend.AI is event-driven; downstream consumers subscribe to
session.retry_*events. - Structural cause enum, not exception types. Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process.
max_retriesis a count. Total attempts =1 + max_retries, matching Backend.AI conventions and the existing pipeline orchestrator.
A central classify_failure(session, status_data) → RetryEligibleCause. Hardcoded non-retriable causes outside the enum: USER_CANCELLED, VALIDATION_ERROR, QUOTA_EXCEEDED. Users cannot opt these into retry.
| Cause | Default eligible | Notes |
|---|---|---|
AGENT_TRANSIENT |
yes | Lost heartbeat, agent restart mid-run. |
SCHEDULER_TIMEOUT |
yes | Kernel-creation timeout under cluster pressure. |
IMAGE_PULL_FAILURE |
yes | Typo wastes a few seconds with backoff; registry blip is real. |
KERNEL_NONZERO_EXIT |
yes | The most common reason batch users want retry. |
OOM_KILLED |
yes | Retry without resource bump usually fails again, but exhausting max_retries is cheap. |
UNKNOWN |
yes | Conservative for unclassified failures. |
USER_CANCELLED |
hardcoded never | Permanent. |
VALIDATION_ERROR / QUOTA_EXCEEDED |
hardcoded never | Permanent. |
base = retry_delay if backoff == FIXED
min(retry_delay * backoff_multiplier ** retry_count, otherwise
max_retry_delay or MAX_RETRY_DELAY)
delay = apply_jitter(base, mode=jitter, ratio=jitter_ratio,
seed=(session_id, retry_count))
delay = min(delay, max_retry_delay or MAX_RETRY_DELAY)
MAX_RETRY_DELAY is a hard 24 h ceiling. Deterministic jitter takes SHA1(session_id || retry_count) mod (base * jitter_ratio), yielding reproducible delays — useful for tests. Random jitter samples uniformly in [base, base * (1 + jitter_ratio)).
Three layers, matching Airflow's default_args propagation:
- Per-session policy in the create request.
- Project / domain default (new optional field, admin-managed).
- Cluster default in etcd:
config/manager/retry_policy_default. Ship default:max_retries=0→ no behavior change.
Effective policy = deep-merge top-down; per-session wins.
One Alembic migration adds to sessions:
parent_session_id : UUID NULL (self-FK)
retry_count : INT NOT NULL DEFAULT 0
max_retries : INT NOT NULL DEFAULT 0
retry_policy : JSONB NULL
retry_cause : TEXT NULL
Rationale: parent_session_id, retry_count, max_retries are first-class columns because they are queried for filters and joins. The rest live in JSONB. No new history table — the chain is a linked list of real SessionRows, each with its own status, kernels, logs, and status_data. Cheaper than a separate history table and consistent with Backend.AI's existing model.
A new handler at event_dispatcher/handlers/session_retry.py subscribes to session.terminated / session.error:
- Load session. If
retry_count >= max_retries→ emitsession.retry_exhaustedand return. - Classify failure. If cause not in
eligible_causes(or in hardcoded never-retry set) → return. - Acquire row lock with
select_for_update(). If a child with deterministiccreation_id = parent.creation_id + ":retry:" + (retry_count + 1)already exists → return (idempotency). - Compute
delayper the formula above. - Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep.
The retry path calls SessionService.create_from_params() with a CreateFromParamsAction derived from the parent (image, mounts, resource_slots, env, cluster spec, batch entrypoint). The child inherits retry_policy, sets parent_session_id to the parent, and retry_count = parent.retry_count + 1.
No new RETRYING status. The parent goes to ERROR as today; the child starts in PENDING as today. A computed retry_state field on the API tells clients "attempt N of M" / "this session has a pending child." This avoids touching the scheduler state machine.
REST v2 (api/rest/v2/sessions/):
POST /sessions— accept optionalretry_policyin the request body.GET /sessions/{id}— returnparent_session_id,retry_count,max_retries,retry_policy,retry_cause, plus computedretry_chain(oldest → newest IDs).GET /sessions/{id}/attempts— return the chain with status of each attempt.
GraphQL v2: mirror in api/gql/session/types.py — parentSession, retryCount, maxRetries, retryPolicy, retryCause, resolver retryChain.
Client SDK v2 + CLI v2: expose new fields; ./bai session info shows attempt N of M and links to the parent.
No retry mutation in v1. Manual retry is deferred until the auto path is stable.
- Counters:
bai_session_retry_scheduled_total{cause},bai_session_retry_exhausted_total{cause},bai_session_retry_succeeded_total. - Events:
session.retry_scheduled,session.retry_exhausted— consumable by the webhook plugin. Replace the role of Airflow'son_retry_callbackfor downstream consumers. - Audit log entry per retry dispatch (auto, cause, attempt N of M).
- Default
max_retries=0⇒ zero behavior change for existing callers. - All new columns are nullable or default to safe zero values.
- Existing GraphQL and REST clients continue to work; new fields are additive.
- Apply Alembic migration adding the five columns. Migration is idempotent and backportable per
src/ai/backend/manager/models/alembic/README.md. - Deploy manager with retry handler and surface, default off via etcd.
- Operators opt in by setting cluster default or per-session policy.
- External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental.
None.
Six PRs, each tracked by its own sub-issue under #11320:
- BEP draft (this document) — #11321.
- Foundation:
RetryPolicyDTO,classify_failuremodule, backoff utility (with deterministic jitter). Pure functions, no I/O, unit-test heavy. - Schema: Alembic migration,
SessionRowfield expansion, repository read/write for retry chain. Backportable. - Retry engine: event handler,
SessionService.create_from_paramsextension, defaults precedence (project/domain/etcd), counters/events/audit. - API surface: REST v2 and GraphQL v2 fields,
attemptsendpoint. - Client: SDK v2, CLI v2 (
./bai session inforetry view), user docs.
Tests live with the code under test. Cross-cutting integration tests (transient → retry → success; exhaustion path; concurrent dispatch idempotency; jitter determinism) ship with the retry-engine PR.
Estimated effort: three to four weeks for one engineer.
| Date | Decision | Rationale |
|---|---|---|
| 2026-04-27 | Batch sessions only in v1 | Interactive sessions are user-driven and do not fit auto-retry semantics. |
| 2026-04-27 | Each retry is a fresh session, linked via parent_session_id |
Matches existing pipeline orchestrator semantics; avoids reusing kernels/scratch and the complexity that would entail. |
| 2026-04-27 | No new RETRYING status |
Parent goes to ERROR, child starts PENDING — avoids touching the scheduler state machine. Computed retry_state on the API is enough for clients. |
| 2026-04-27 | Linked-list chain, not a separate history table | The chain is already a list of real SessionRows; no need to duplicate. |
| 2026-04-27 | Structural RetryEligibleCause enum, not exception-typed |
Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process. |
| 2026-04-27 | KERNEL_NONZERO_EXIT is in the default eligible set |
max_retries > 0 should be the only knob a typical user touches; matches Airflow's "retry on failure, period" model. |
| 2026-04-27 | USER_CANCELLED / VALIDATION_ERROR / QUOTA_EXCEEDED are hardcoded non-retriable |
These are permanent by definition; users cannot opt them into retry. |
| 2026-04-27 | No retry mutation in v1 | Auto path stabilizes first; manual retry's interaction with max_retries is itself a design decision. |
| 2026-04-27 | Idempotency via deterministic child creation_id |
Reuses an existing field; no new uniqueness constraint required. |
| 2026-04-27 | Deterministic jitter seed = (session_id, retry_count) |
Reproducible for tests; trade-off vs. unpredictability is acceptable for a server-side retry. |
- Quota accounting: do retries count against concurrent-session limits? Likely yes, but needs a product call.
- Retry-storm kill switch: should the etcd default be a single boolean toggle, a rate limit, or both? Leaning toward a boolean for v1 with a rate limit deferred.
- Manual retry in v2: counts toward
max_retriesor independent? Decide before exposing. - Default for
max_retry_delay: 1 h is conservative for long-running batch jobs that might benefit from a longer cooldown after repeated failures. Revisit after telemetry. - Project/domain defaults table location: extend an existing table or add a small new
project_retry_defaultstable?
- Working draft:
docs/investigation/native-session-retry-plan.md - Apache Airflow retry implementation:
airflow-core/src/airflow/models/taskinstance.py:1109-1159 - Existing scheduler state-machine BEP: BEP-1030
- Alembic backport strategy:
src/ai/backend/manager/models/alembic/README.md