[Feature]: Support agent graceful shutdown

**<u>AgentScope-Java is an open-source project. To involve a broader community, we recommend asking your questions in English.</u>**


**Is your feature request related to a problem? Please describe.**

When an agent application receives a SIGTERM signal (e.g., during Kubernetes rolling updates, cloud platform auto-scaling, or manual `kill` commands), the JVM shuts down immediately. This causes several problems:

1. **In-flight LLM calls are wasted** — If an agent is mid-reasoning or mid-tool-execution, the ongoing LLM API call (which costs tokens/money) is abruptly terminated and its output is lost.
2. **Agent state is lost** — Any intermediate reasoning steps, tool results, or accumulated memory that hasn't been persisted is discarded, forcing a full restart from scratch on the next request.
3. **HTTP connections leak** — `HttpTransport` resources may not be properly closed because the existing shutdown hook doesn't coordinate with agent lifecycle, potentially closing transports while agents are still using them.
4. **No recovery path** — When a client retries after a shutdown-interrupted request, there is no mechanism to resume from where the agent left off; the duplicate user prompt leads to redundant processing.

**Describe the solution you'd like**

A framework-level graceful shutdown mechanism that:

1. **Three-phase shutdown lifecycle** (`RUNNING → SHUTTING_DOWN → TERMINATED`):
   - `RUNNING`: Normal operation, all requests accepted.
   - `SHUTTING_DOWN`: New requests are rejected with `AgentShuttingDownException`; in-flight requests are allowed to reach safe checkpoints before being interrupted.
   - `TERMINATED`: All requests have completed or timed out; HTTP transports are then closed.

2. **Safe checkpoint interruption via Hook system**:
   - A system-level `GracefulShutdownHook` is registered automatically.
   - During `SHUTTING_DOWN`, agents are interrupted only at safe checkpoints — after a reasoning phase completes (`PostReasoningEvent`), after tool execution completes (`PostActingEvent`), or after summary generation completes (`PostSummaryEvent`).
   - This ensures LLM output tokens are not wasted: the current phase finishes before the interrupt is issued.

3. **Configurable shutdown timeout with force interrupt**:
   - `GracefulShutdownConfig` allows setting a `shutdownTimeout` (or `null` for infinite wait) and a `partialReasoningPolicy` (`SAVE` or `DISCARD`).
   - A background monitor thread checks elapsed time every second; when the timeout is reached, all remaining active requests are force-interrupted and their state is persisted.

4. **Session-based state persistence and recovery**:
   - When an agent is interrupted during shutdown, its current state (memory, reasoning progress) is saved to the `Session` along with a `ShutdownInterruptedState` flag.
   - On the next request, the `GracefulShutdownHook` detects the flag at `PreCallEvent`, clears it, and discards the duplicate user input — allowing the agent to seamlessly resume from saved memory context.

5. **Active request tracking**:
   - `GracefulShutdownManager` (singleton) tracks all in-flight agent requests via `registerRequest` / `unregisterRequest`, integrated into `AgentBase.call()` using `Mono.using` to guarantee cleanup on success, error, or cancellation.

6. **Tool execution shutdown guard**:
   - `ToolExecutor` races each tool execution against a shutdown timeout signal (`Mono.firstWithSignal`), so long-running tools are cancelled promptly when the shutdown timeout is reached.

7. **Ordered resource cleanup**:
   - A unified `AgentScopeJvmShutdownHook` replaces the previous per-transport shutdown hooks, ensuring the shutdown order is: (1) graceful shutdown of agents → (2) await termination → (3) close HTTP transports.

**Describe alternatives you've considered**

1. **Application-level shutdown handling** — Requiring each application to implement its own shutdown logic (e.g., Spring's `@PreDestroy`, custom JVM hooks). This pushes complexity to users and doesn't benefit from framework-level agent lifecycle knowledge (safe checkpoints, session persistence).

2. **Simple timeout-based kill** — Just waiting a fixed duration before killing the process, without safe checkpoints. This wastes in-flight LLM tokens since reasoning phases would be interrupted mid-stream.

3. **Kubernetes preStop hook only** — Relying solely on Kubernetes preStop hooks to drain requests. This doesn't handle the agent-internal state (memory, reasoning progress) and has no concept of checkpoint-based interruption.

**Additional context**

Key components introduced:

| Class | Responsibility |
|---|---|
| `GracefulShutdownManager` | Singleton managing shutdown state machine, active request tracking, timeout enforcement |
| `GracefulShutdownConfig` | Configuration record: `shutdownTimeout`, `partialReasoningPolicy` |
| `GracefulShutdownHook` | System hook for checkpoint interruption and session resume deduplication |
| `AgentScopeJvmShutdownHook` | JVM shutdown hook with ordered cleanup |
| `ActiveRequestContext` | Per-request context with session binding and interrupt capability |
| `AgentShuttingDownException` | Exception for rejected/interrupted requests |
| `ShutdownState` | Enum: `RUNNING`, `SHUTTING_DOWN`, `TERMINATED` |
| `PartialReasoningPolicy` | Enum: `SAVE` or `DISCARD` incomplete reasoning on shutdown |
| `ShutdownInterruptedState` | Session-persisted flag for detecting interrupted sessions |
| `ShutdownSessionBinding` | Binds an agent to its session for shutdown persistence |

Shutdown flow diagram:

```
SIGTERM received
    │
    ▼
AgentScopeJvmShutdownHook
    │
    ├─► GracefulShutdownManager.performGracefulShutdown()
    │       state: RUNNING → SHUTTING_DOWN
    │       ├── New requests → rejected (AgentShuttingDownException)
    │       └── In-flight requests → continue until safe checkpoint
    │
    ├─► Monitor thread (every 1s)
    │       └── If timeout reached:
    │               ├── Save all active request states to Session
    │               ├── Force interrupt all active agents
    │               └── Emit shutdownTimeoutSignal (cancels tool executions)
    │
    ├─► GracefulShutdownHook (at checkpoints)
    │       └── PostReasoning / PostActing / PostSummary
    │               └── interruptIfShuttingDown(agent)
    │
    ├─► awaitTermination(timeout)
    │       └── Block until state == TERMINATED or timeout
    │
    └─► HttpTransportFactory.shutdown()
            └── Close all HTTP transports (only after agents are done)
```

Session recovery on retry:

```
Client retries after shutdown interruption
    │
    ▼
Agent.loadIfExists(session, sessionKey)
    └── bindSession() to GracefulShutdownManager
    │
    ▼
GracefulShutdownHook.onEvent(PreCallEvent)
    └── checkAndClearShutdownInterrupted()
            ├── Flag found → discard duplicate input, resume from saved memory
            └── Flag absent → normal processing
```


Class	Responsibility
`GracefulShutdownManager`	Singleton managing shutdown state machine, active request tracking, timeout enforcement
`GracefulShutdownConfig`	Configuration record: `shutdownTimeout`, `partialReasoningPolicy`
`GracefulShutdownHook`	System hook for checkpoint interruption and session resume deduplication
`AgentScopeJvmShutdownHook`	JVM shutdown hook with ordered cleanup
`ActiveRequestContext`	Per-request context with session binding and interrupt capability
`AgentShuttingDownException`	Exception for rejected/interrupted requests
`ShutdownState`	Enum: `RUNNING`, `SHUTTING_DOWN`, `TERMINATED`
`PartialReasoningPolicy`	Enum: `SAVE` or `DISCARD` incomplete reasoning on shutdown
`ShutdownInterruptedState`	Session-persisted flag for detecting interrupted sessions
`ShutdownSessionBinding`	Binds an agent to its session for shutdown persistence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support agent graceful shutdown #907

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Support agent graceful shutdown #907

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions