Skip to content

Author BEP for native session retry #11321

@rapsealk

Description

@rapsealk

Sub-issue of #11320 (Epic: Native session retry support for batch sessions).

Goal

Author the design BEPs for batch resilience, so the rest of the implementation work has accepted reference documents to point at.

Scope

Two BEPs, both shipping in PR #11322:

  • BEP-1053 — Agent-level Batch Retry. Per-session batch_retries / batch_retry_delay knobs; agent re-runs the entrypoint inside the same kernel on non-zero exit. No manager-side state.
  • BEP-1054 — Session Rescheduling on Terminal Failure. New sokovan SessionLifecycleHandler that reschedules terminal-failed batch sessions to a different node when the failure is node-level. Reuses phase_attempts, makes SERVICE_MAX_RETRIES configurable, classification via etcd pattern config (extensible, not a closed enum).

Pivot rationale captured at docs/investigation/bep-1053-design-pivot.md.

Out of scope

  • Code implementation. Each BEP's Implementation Plan section enumerates the follow-up PRs.

Acceptance

Target version

26.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions