Sub-issue of #11320 (Epic: Native session retry support for batch sessions).
Goal
Author the design BEPs for batch resilience, so the rest of the implementation work has accepted reference documents to point at.
Scope
Two BEPs, both shipping in PR #11322:
- BEP-1053 — Agent-level Batch Retry. Per-session
batch_retries / batch_retry_delay knobs; agent re-runs the entrypoint inside the same kernel on non-zero exit. No manager-side state.
- BEP-1054 — Session Rescheduling on Terminal Failure. New sokovan
SessionLifecycleHandler that reschedules terminal-failed batch sessions to a different node when the failure is node-level. Reuses phase_attempts, makes SERVICE_MAX_RETRIES configurable, classification via etcd pattern config (extensible, not a closed enum).
Pivot rationale captured at docs/investigation/bep-1053-design-pivot.md.
Out of scope
- Code implementation. Each BEP's Implementation Plan section enumerates the follow-up PRs.
Acceptance
Target version
26.5
Sub-issue of #11320 (Epic: Native session retry support for batch sessions).
Goal
Author the design BEPs for batch resilience, so the rest of the implementation work has accepted reference documents to point at.
Scope
Two BEPs, both shipping in PR #11322:
batch_retries/batch_retry_delayknobs; agent re-runs the entrypoint inside the same kernel on non-zero exit. No manager-side state.SessionLifecycleHandlerthat reschedules terminal-failed batch sessions to a different node when the failure is node-level. Reusesphase_attempts, makesSERVICE_MAX_RETRIESconfigurable, classification via etcd pattern config (extensible, not a closed enum).Pivot rationale captured at
docs/investigation/bep-1053-design-pivot.md.Out of scope
Acceptance
mainwith statusDraft.Target version
26.5