-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Describe the Bug
In core/framework/server/session_manager.py, the _load_worker_core method is responsible for provisioning a worker for a session.
As part of its startup process, it calls _cleanup_stale_active_sessions(agent_path), which iterates over all session directories for that agent looking for state.json files. If it finds any session where status == "active", it immediately overwrites it to status = "cancelled" with the error "Stale session: runtime restarted".
This logic fails to distinguish between orphaned sessions and healthy sessions running concurrently in the same process or a different worker process.
If multiple sessions for the same agent run concurrently, loading a new worker aggressively corrupts the state.json of the active worker, even though the session is still executing correctly in memory.
To Reproduce
Steps to reproduce the behavior:
-
Start the server:
uv run python -m framework.runner.cli serve -
Create Session A for an agent and start an active execution.
-
While Session A is executing, create Session B for the same agent.
-
Inspect the
state.jsonfile for Session A. -
Observe that it was forcibly mutated to:
{
"status": "cancelled",
"error": "Stale session: runtime restarted"
}Even though the task is still actively executing in memory.
Expected Behavior
The _cleanup_stale_active_sessions function should only clean up sessions belonging to dead processes, rather than unconditionally cancelling all active sessions for the agent on disk.
Possible solutions include:
- Tracking process IDs (PID) for active sessions.
- Using file locks or advisory locks.
- Maintaining a worker registry for active runtime processes.
Concurrent sessions for the same agent should remain fully isolated and unaffected by the creation of new sessions.
Logs
INFO: Marked stale session 'session_....' as cancelled for agent 'my_agent'
Additional Context
This behavior breaks multi-tenant execution guarantees and causes silent state corruption.
The result is a desynchronization between:
- In-memory execution state (worker still running), and
- Persistent session state used by the UI.
This can lead to incorrect UI reporting, broken workflows, and misleading cancellation states for still-running sessions.
Suggested Fix
-
In-Memory Protection: Update
_cleanup_stale_active_sessionsto skip any session IDs currently tracked in the live SessionManager._sessions map. -
PID Validation: Store the
os.getpid()instate.jsonwhen a session starts. Before cancelling a stale session, verify that the recorded PID is no longer running on the host system. -
Cross-Process Locking (Optional): Implement a file-based lock (e.g., session.lock) within the session directory to prevent multiple worker instances from claiming or cleaning up the same session state simultaneously.