-
Notifications
You must be signed in to change notification settings - Fork 385
Description
AgentScope-Java is an open-source project. To involve a broader community, we recommend asking your questions in English.
Is your feature request related to a problem? Please describe.
When an agent application receives a SIGTERM signal (e.g., during Kubernetes rolling updates, cloud platform auto-scaling, or manual kill commands), the JVM shuts down immediately. This causes several problems:
- In-flight LLM calls are wasted — If an agent is mid-reasoning or mid-tool-execution, the ongoing LLM API call (which costs tokens/money) is abruptly terminated and its output is lost.
- Agent state is lost — Any intermediate reasoning steps, tool results, or accumulated memory that hasn't been persisted is discarded, forcing a full restart from scratch on the next request.
- HTTP connections leak —
HttpTransportresources may not be properly closed because the existing shutdown hook doesn't coordinate with agent lifecycle, potentially closing transports while agents are still using them. - No recovery path — When a client retries after a shutdown-interrupted request, there is no mechanism to resume from where the agent left off; the duplicate user prompt leads to redundant processing.
Describe the solution you'd like
A framework-level graceful shutdown mechanism that:
-
Three-phase shutdown lifecycle (
RUNNING → SHUTTING_DOWN → TERMINATED):RUNNING: Normal operation, all requests accepted.SHUTTING_DOWN: New requests are rejected withAgentShuttingDownException; in-flight requests are allowed to reach safe checkpoints before being interrupted.TERMINATED: All requests have completed or timed out; HTTP transports are then closed.
-
Safe checkpoint interruption via Hook system:
- A system-level
GracefulShutdownHookis registered automatically. - During
SHUTTING_DOWN, agents are interrupted only at safe checkpoints — after a reasoning phase completes (PostReasoningEvent), after tool execution completes (PostActingEvent), or after summary generation completes (PostSummaryEvent). - This ensures LLM output tokens are not wasted: the current phase finishes before the interrupt is issued.
- A system-level
-
Configurable shutdown timeout with force interrupt:
GracefulShutdownConfigallows setting ashutdownTimeout(ornullfor infinite wait) and apartialReasoningPolicy(SAVEorDISCARD).- A background monitor thread checks elapsed time every second; when the timeout is reached, all remaining active requests are force-interrupted and their state is persisted.
-
Session-based state persistence and recovery:
- When an agent is interrupted during shutdown, its current state (memory, reasoning progress) is saved to the
Sessionalong with aShutdownInterruptedStateflag. - On the next request, the
GracefulShutdownHookdetects the flag atPreCallEvent, clears it, and discards the duplicate user input — allowing the agent to seamlessly resume from saved memory context.
- When an agent is interrupted during shutdown, its current state (memory, reasoning progress) is saved to the
-
Active request tracking:
GracefulShutdownManager(singleton) tracks all in-flight agent requests viaregisterRequest/unregisterRequest, integrated intoAgentBase.call()usingMono.usingto guarantee cleanup on success, error, or cancellation.
-
Tool execution shutdown guard:
ToolExecutorraces each tool execution against a shutdown timeout signal (Mono.firstWithSignal), so long-running tools are cancelled promptly when the shutdown timeout is reached.
-
Ordered resource cleanup:
- A unified
AgentScopeJvmShutdownHookreplaces the previous per-transport shutdown hooks, ensuring the shutdown order is: (1) graceful shutdown of agents → (2) await termination → (3) close HTTP transports.
- A unified
Describe alternatives you've considered
-
Application-level shutdown handling — Requiring each application to implement its own shutdown logic (e.g., Spring's
@PreDestroy, custom JVM hooks). This pushes complexity to users and doesn't benefit from framework-level agent lifecycle knowledge (safe checkpoints, session persistence). -
Simple timeout-based kill — Just waiting a fixed duration before killing the process, without safe checkpoints. This wastes in-flight LLM tokens since reasoning phases would be interrupted mid-stream.
-
Kubernetes preStop hook only — Relying solely on Kubernetes preStop hooks to drain requests. This doesn't handle the agent-internal state (memory, reasoning progress) and has no concept of checkpoint-based interruption.
Additional context
Key components introduced:
| Class | Responsibility |
|---|---|
GracefulShutdownManager |
Singleton managing shutdown state machine, active request tracking, timeout enforcement |
GracefulShutdownConfig |
Configuration record: shutdownTimeout, partialReasoningPolicy |
GracefulShutdownHook |
System hook for checkpoint interruption and session resume deduplication |
AgentScopeJvmShutdownHook |
JVM shutdown hook with ordered cleanup |
ActiveRequestContext |
Per-request context with session binding and interrupt capability |
AgentShuttingDownException |
Exception for rejected/interrupted requests |
ShutdownState |
Enum: RUNNING, SHUTTING_DOWN, TERMINATED |
PartialReasoningPolicy |
Enum: SAVE or DISCARD incomplete reasoning on shutdown |
ShutdownInterruptedState |
Session-persisted flag for detecting interrupted sessions |
ShutdownSessionBinding |
Binds an agent to its session for shutdown persistence |
Shutdown flow diagram:
SIGTERM received
│
▼
AgentScopeJvmShutdownHook
│
├─► GracefulShutdownManager.performGracefulShutdown()
│ state: RUNNING → SHUTTING_DOWN
│ ├── New requests → rejected (AgentShuttingDownException)
│ └── In-flight requests → continue until safe checkpoint
│
├─► Monitor thread (every 1s)
│ └── If timeout reached:
│ ├── Save all active request states to Session
│ ├── Force interrupt all active agents
│ └── Emit shutdownTimeoutSignal (cancels tool executions)
│
├─► GracefulShutdownHook (at checkpoints)
│ └── PostReasoning / PostActing / PostSummary
│ └── interruptIfShuttingDown(agent)
│
├─► awaitTermination(timeout)
│ └── Block until state == TERMINATED or timeout
│
└─► HttpTransportFactory.shutdown()
└── Close all HTTP transports (only after agents are done)
Session recovery on retry:
Client retries after shutdown interruption
│
▼
Agent.loadIfExists(session, sessionKey)
└── bindSession() to GracefulShutdownManager
│
▼
GracefulShutdownHook.onEvent(PreCallEvent)
└── checkAndClearShutdownInterrupted()
├── Flag found → discard duplicate input, resume from saved memory
└── Flag absent → normal processing
Metadata
Metadata
Assignees
Labels
Type
Projects
Status