Skip to content

[Feature]: Support agent graceful shutdown #907

@uuuyuqi

Description

@uuuyuqi

AgentScope-Java is an open-source project. To involve a broader community, we recommend asking your questions in English.

Is your feature request related to a problem? Please describe.

When an agent application receives a SIGTERM signal (e.g., during Kubernetes rolling updates, cloud platform auto-scaling, or manual kill commands), the JVM shuts down immediately. This causes several problems:

  1. In-flight LLM calls are wasted — If an agent is mid-reasoning or mid-tool-execution, the ongoing LLM API call (which costs tokens/money) is abruptly terminated and its output is lost.
  2. Agent state is lost — Any intermediate reasoning steps, tool results, or accumulated memory that hasn't been persisted is discarded, forcing a full restart from scratch on the next request.
  3. HTTP connections leakHttpTransport resources may not be properly closed because the existing shutdown hook doesn't coordinate with agent lifecycle, potentially closing transports while agents are still using them.
  4. No recovery path — When a client retries after a shutdown-interrupted request, there is no mechanism to resume from where the agent left off; the duplicate user prompt leads to redundant processing.

Describe the solution you'd like

A framework-level graceful shutdown mechanism that:

  1. Three-phase shutdown lifecycle (RUNNING → SHUTTING_DOWN → TERMINATED):

    • RUNNING: Normal operation, all requests accepted.
    • SHUTTING_DOWN: New requests are rejected with AgentShuttingDownException; in-flight requests are allowed to reach safe checkpoints before being interrupted.
    • TERMINATED: All requests have completed or timed out; HTTP transports are then closed.
  2. Safe checkpoint interruption via Hook system:

    • A system-level GracefulShutdownHook is registered automatically.
    • During SHUTTING_DOWN, agents are interrupted only at safe checkpoints — after a reasoning phase completes (PostReasoningEvent), after tool execution completes (PostActingEvent), or after summary generation completes (PostSummaryEvent).
    • This ensures LLM output tokens are not wasted: the current phase finishes before the interrupt is issued.
  3. Configurable shutdown timeout with force interrupt:

    • GracefulShutdownConfig allows setting a shutdownTimeout (or null for infinite wait) and a partialReasoningPolicy (SAVE or DISCARD).
    • A background monitor thread checks elapsed time every second; when the timeout is reached, all remaining active requests are force-interrupted and their state is persisted.
  4. Session-based state persistence and recovery:

    • When an agent is interrupted during shutdown, its current state (memory, reasoning progress) is saved to the Session along with a ShutdownInterruptedState flag.
    • On the next request, the GracefulShutdownHook detects the flag at PreCallEvent, clears it, and discards the duplicate user input — allowing the agent to seamlessly resume from saved memory context.
  5. Active request tracking:

    • GracefulShutdownManager (singleton) tracks all in-flight agent requests via registerRequest / unregisterRequest, integrated into AgentBase.call() using Mono.using to guarantee cleanup on success, error, or cancellation.
  6. Tool execution shutdown guard:

    • ToolExecutor races each tool execution against a shutdown timeout signal (Mono.firstWithSignal), so long-running tools are cancelled promptly when the shutdown timeout is reached.
  7. Ordered resource cleanup:

    • A unified AgentScopeJvmShutdownHook replaces the previous per-transport shutdown hooks, ensuring the shutdown order is: (1) graceful shutdown of agents → (2) await termination → (3) close HTTP transports.

Describe alternatives you've considered

  1. Application-level shutdown handling — Requiring each application to implement its own shutdown logic (e.g., Spring's @PreDestroy, custom JVM hooks). This pushes complexity to users and doesn't benefit from framework-level agent lifecycle knowledge (safe checkpoints, session persistence).

  2. Simple timeout-based kill — Just waiting a fixed duration before killing the process, without safe checkpoints. This wastes in-flight LLM tokens since reasoning phases would be interrupted mid-stream.

  3. Kubernetes preStop hook only — Relying solely on Kubernetes preStop hooks to drain requests. This doesn't handle the agent-internal state (memory, reasoning progress) and has no concept of checkpoint-based interruption.

Additional context

Key components introduced:

Class Responsibility
GracefulShutdownManager Singleton managing shutdown state machine, active request tracking, timeout enforcement
GracefulShutdownConfig Configuration record: shutdownTimeout, partialReasoningPolicy
GracefulShutdownHook System hook for checkpoint interruption and session resume deduplication
AgentScopeJvmShutdownHook JVM shutdown hook with ordered cleanup
ActiveRequestContext Per-request context with session binding and interrupt capability
AgentShuttingDownException Exception for rejected/interrupted requests
ShutdownState Enum: RUNNING, SHUTTING_DOWN, TERMINATED
PartialReasoningPolicy Enum: SAVE or DISCARD incomplete reasoning on shutdown
ShutdownInterruptedState Session-persisted flag for detecting interrupted sessions
ShutdownSessionBinding Binds an agent to its session for shutdown persistence

Shutdown flow diagram:

SIGTERM received
    │
    ▼
AgentScopeJvmShutdownHook
    │
    ├─► GracefulShutdownManager.performGracefulShutdown()
    │       state: RUNNING → SHUTTING_DOWN
    │       ├── New requests → rejected (AgentShuttingDownException)
    │       └── In-flight requests → continue until safe checkpoint
    │
    ├─► Monitor thread (every 1s)
    │       └── If timeout reached:
    │               ├── Save all active request states to Session
    │               ├── Force interrupt all active agents
    │               └── Emit shutdownTimeoutSignal (cancels tool executions)
    │
    ├─► GracefulShutdownHook (at checkpoints)
    │       └── PostReasoning / PostActing / PostSummary
    │               └── interruptIfShuttingDown(agent)
    │
    ├─► awaitTermination(timeout)
    │       └── Block until state == TERMINATED or timeout
    │
    └─► HttpTransportFactory.shutdown()
            └── Close all HTTP transports (only after agents are done)

Session recovery on retry:

Client retries after shutdown interruption
    │
    ▼
Agent.loadIfExists(session, sessionKey)
    └── bindSession() to GracefulShutdownManager
    │
    ▼
GracefulShutdownHook.onEvent(PreCallEvent)
    └── checkAndClearShutdownInterrupted()
            ├── Flag found → discard duplicate input, resume from saved memory
            └── Flag absent → normal processing

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions