Skip to content

[Bug][Security]: SessionManager._cleanup_stale_active_sessions indiscriminately cancels healthy concurrent agent sessions #5985

@navyabijoy

Description

@navyabijoy

Describe the Bug

In core/framework/server/session_manager.py, the _load_worker_core method is responsible for provisioning a worker for a session.

As part of its startup process, it calls _cleanup_stale_active_sessions(agent_path), which iterates over all session directories for that agent looking for state.json files. If it finds any session where status == "active", it immediately overwrites it to status = "cancelled" with the error "Stale session: runtime restarted".

This logic fails to distinguish between orphaned sessions and healthy sessions running concurrently in the same process or a different worker process.

If multiple sessions for the same agent run concurrently, loading a new worker aggressively corrupts the state.json of the active worker, even though the session is still executing correctly in memory.

To Reproduce

Steps to reproduce the behavior:

  1. Start the server:

    uv run python -m framework.runner.cli serve

  2. Create Session A for an agent and start an active execution.

  3. While Session A is executing, create Session B for the same agent.

  4. Inspect the state.json file for Session A.

  5. Observe that it was forcibly mutated to:

{
  "status": "cancelled",
  "error": "Stale session: runtime restarted"
}

Even though the task is still actively executing in memory.

Expected Behavior

The _cleanup_stale_active_sessions function should only clean up sessions belonging to dead processes, rather than unconditionally cancelling all active sessions for the agent on disk.

Possible solutions include:

  • Tracking process IDs (PID) for active sessions.
  • Using file locks or advisory locks.
  • Maintaining a worker registry for active runtime processes.

Concurrent sessions for the same agent should remain fully isolated and unaffected by the creation of new sessions.

Logs

INFO: Marked stale session 'session_....' as cancelled for agent 'my_agent'

Additional Context

This behavior breaks multi-tenant execution guarantees and causes silent state corruption.

The result is a desynchronization between:

  • In-memory execution state (worker still running), and
  • Persistent session state used by the UI.

This can lead to incorrect UI reporting, broken workflows, and misleading cancellation states for still-running sessions.

Suggested Fix

  1. In-Memory Protection: Update _cleanup_stale_active_sessions to skip any session IDs currently tracked in the live SessionManager._sessions map.

  2. PID Validation: Store the os.getpid() in state.json when a session starts. Before cancelling a stale session, verify that the recorded PID is no longer running on the host system.

  3. Cross-Process Locking (Optional): Implement a file-based lock (e.g., session.lock) within the session directory to prevent multiple worker instances from claiming or cleaning up the same session state simultaneously.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions