[Bug][Security]: SessionManager._cleanup_stale_active_sessions indiscriminately cancels healthy concurrent agent sessions

## Describe the Bug

In `core/framework/server/session_manager.py`, the `_load_worker_core` method is responsible for provisioning a worker for a session.

As part of its startup process, it calls `_cleanup_stale_active_sessions(agent_path)`, which iterates over all session directories for that agent looking for `state.json` files. If it finds any session where `status == "active"`, it immediately overwrites it to `status = "cancelled"` with the error `"Stale session: runtime restarted"`.

This logic fails to distinguish between **orphaned sessions** and **healthy sessions running concurrently** in the same process or a different worker process.

If multiple sessions for the same agent run concurrently, loading a new worker **aggressively corrupts the `state.json` of the active worker**, even though the session is still executing correctly in memory.

## To Reproduce

Steps to reproduce the behavior:

1. Start the server:
 
   `uv run python -m framework.runner.cli serve
`

2. Create **Session A** for an agent and start an active execution.

3. While **Session A** is executing, create **Session B** for the same agent.

4. Inspect the `state.json` file for **Session A**.

5. Observe that it was forcibly mutated to:

```json
{
  "status": "cancelled",
  "error": "Stale session: runtime restarted"
}
```

Even though the task is still actively executing in memory.

## Expected Behavior

The `_cleanup_stale_active_sessions` function should only clean up **sessions belonging to dead processes**, rather than unconditionally cancelling all active sessions for the agent on disk.

Possible solutions include:

* Tracking **process IDs (PID)** for active sessions.
* Using **file locks or advisory locks**.
* Maintaining a **worker registry** for active runtime processes.

Concurrent sessions for the same agent should remain **fully isolated and unaffected** by the creation of new sessions.

## Logs

```
INFO: Marked stale session 'session_....' as cancelled for agent 'my_agent'
```

## Additional Context

This behavior breaks **multi-tenant execution guarantees** and causes **silent state corruption**.

The result is a desynchronization between:

* **In-memory execution state** (worker still running), and
* **Persistent session state** used by the UI.

This can lead to incorrect UI reporting, broken workflows, and misleading cancellation states for still-running sessions.

## Suggested Fix

1. **In-Memory Protection:** Update ` _cleanup_stale_active_sessions` to skip any session IDs currently tracked in the live SessionManager._sessions map.

2. **PID Validation:** Store the `os.getpid()` in `state.json` when a session starts. Before cancelling a stale session, verify that the recorded PID is no longer running on the host system.

3. **Cross-Process Locking (Optional):** Implement a file-based lock (e.g., session.lock) within the session directory to prevent multiple worker instances from claiming or cleaning up the same session state simultaneously.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][Security]: SessionManager._cleanup_stale_active_sessions indiscriminately cancels healthy concurrent agent sessions #5985

Describe the Bug

To Reproduce

Expected Behavior

Logs

Additional Context

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug][Security]: SessionManager._cleanup_stale_active_sessions indiscriminately cancels healthy concurrent agent sessions #5985

Description

Describe the Bug

To Reproduce

Expected Behavior

Logs

Additional Context

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions