Skip to content

fix(security): harden approval thread safety (TOCTOU + error handling)#2366

Open
zmanian wants to merge 1 commit intostagingfrom
fix/approval-thread-safety-v2
Open

fix(security): harden approval thread safety (TOCTOU + error handling)#2366
zmanian wants to merge 1 commit intostagingfrom
fix/approval-thread-safety-v2

Conversation

@zmanian
Copy link
Copy Markdown
Collaborator

@zmanian zmanian commented Apr 12, 2026

Summary

Consolidates two security fixes for the approval processing flow in thread_ops.rs:

  • TOCTOU race ([CRITICAL] TOCTOU race condition in approval thread resolution #1486): Hold session lock for the entire take-verify sequence in process_approval() so pending approval cannot be lost if a concurrent operation modifies the thread between take and restore. Previously, the lock was dropped after take_pending_approval() and re-acquired for request_id verification, creating a window where the approval could be permanently lost.

  • Silent error fallback ([HIGH] Incomplete fallback logic for non-existent approval threads #1487): Replace 10 silent if let Some(thread) patterns with explicit match arms. Critical paths (state transitions, deferred approval setup) return errors when threads disappear. Non-critical paths (tool result recording, auth mode, rejection) log debug messages but continue.

Files changed

  • src/agent/thread_ops.rs — both fixes + 2 regression tests

Test plan

  • Regression test: test_approval_request_id_mismatch_restores_pending
  • Regression test: test_approval_on_missing_thread_should_error
  • cargo check clean
  • Zero clippy warnings
  • CI: full test suite (local env too memory-constrained for test binary linking)

Supersedes #1591 (branch had no merge base with current staging after repo restructuring).
Closes #1486, Closes #1487

https://claude.ai/code/session_01X86EZxqXEFiU9VetyhPKjM

@github-actions github-actions bot added scope: agent Agent core (agent loop, router, scheduler) size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs labels Apr 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses TOCTOU races in the process_approval function by ensuring that pending approvals are taken and verified under a single lock. It also improves error handling across several thread operations by explicitly checking for missing threads and providing appropriate logging or error returns. Regression tests were added to verify the fix for the race condition and the handling of missing threads. Feedback suggests enhancing visibility into unexpected states by adding warning logs when a thread has no turns during tool result recording.

Comment on lines +1364 to 1385
match sess.threads.get_mut(&thread_id) {
Some(thread) => {
if let Some(turn) = thread.last_turn_mut() {
if is_tool_error {
turn.record_tool_error_for(
&pending.tool_call_id,
result_content.clone(),
);
} else {
turn.record_tool_result_for(
&pending.tool_call_id,
serde_json::json!(result_content),
);
}
}
}
None => {
tracing::debug!(
%thread_id,
"Thread disappeared before tool result could be recorded"
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the change correctly handles the case where the thread is missing, it still silently ignores the case where thread.last_turn_mut() returns None. Given the objective to harden error handling and avoid silent fallbacks, this unexpected state should be surfaced via tracing::warn! to aid in troubleshooting, consistent with the repository's practice for handling persistence and task failures.

                match sess.threads.get_mut(&thread_id) {
                    Some(thread) => {
                        match thread.last_turn_mut() {
                            Some(turn) => {
                                if is_tool_error {
                                    turn.record_tool_error_for(
                                        &pending.tool_call_id,
                                        result_content.clone(),
                                    );
                                } else {
                                    turn.record_tool_result_for(
                                        &pending.tool_call_id,
                                        serde_json::json!(result_content),
                                    );
                                }
                            }
                            None => {
                                tracing::warn!(
                                    %thread_id,
                                    "Thread has no turns; tool result could not be recorded"
                                );
                            }
                        }
                    }
                    None => {
                        tracing::warn!(
                            %thread_id,
                            "Thread disappeared before tool result could be recorded"
                        );
                    }
                }
References
  1. Errors and unexpected states in persistence or task execution should be surfaced via tracing::warn! to ensure reliability and visibility.

Comment on lines +1623 to 1642
match sess.threads.get_mut(&thread_id) {
Some(thread) => {
if let Some(turn) = thread.last_turn_mut() {
if is_deferred_error {
turn.record_tool_error_for(&tc.id, deferred_content.clone());
} else {
turn.record_tool_result_for(
&tc.id,
serde_json::json!(deferred_content),
);
}
}
}
None => {
tracing::debug!(
%thread_id,
tool = %tc.name,
"Thread disappeared before deferred tool result could be recorded"
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the tool result recording path above, the case where thread.last_turn_mut() is None is currently handled silently. Adding a warning log for this scenario ensures consistency in error visibility across the codebase and avoids silent failures during tool result recording.

                    match sess.threads.get_mut(&thread_id) {
                        Some(thread) => {
                            match thread.last_turn_mut() {
                                Some(turn) => {
                                    if is_deferred_error {
                                        turn.record_tool_error_for(&tc.id, deferred_content.clone());
                                    } else {
                                        turn.record_tool_result_for(
                                            &tc.id,
                                            serde_json::json!(deferred_content),
                                        );
                                    }
                                }
                                None => {
                                    tracing::warn!(
                                        %thread_id,
                                        tool = %tc.name,
                                        "Thread has no turns; deferred tool result could not be recorded"
                                    );
                                }
                            }
                        }
                        None => {
                            tracing::warn!(
                                %thread_id,
                                tool = %tc.name,
                                "Thread disappeared before deferred tool result could be recorded"
                            );
                        }
                    }
References
  1. Surface unexpected failures or state inconsistencies via tracing::warn! to ensure reliability and visibility, as seen in rules for audit persistence and task joining.

@github-actions github-actions bot added the scope: dependencies Dependency updates label Apr 12, 2026
Consolidates two security fixes for the approval processing flow in
thread_ops.rs:

**TOCTOU race (#1486):** Hold session lock for the entire take-verify
sequence in process_approval() so pending approval cannot be lost if a
concurrent operation modifies the thread between take and restore.
Previously, the lock was dropped after take_pending_approval() and
re-acquired for request_id verification, creating a window where the
approval could be permanently lost.

**Silent error fallback (#1487):** Replace 10 silent `if let Some(thread)`
patterns with explicit `match` arms. Critical paths (state transitions,
deferred approval setup) return errors when threads disappear. Non-critical
paths (tool result recording, auth mode, rejection) log debug messages but
continue.

Regression tests:
- test_approval_request_id_mismatch_restores_pending
- test_approval_on_missing_thread_should_error

Supersedes #1591 (branch had no merge base with current staging).
Closes #1486, Closes #1487

https://claude.ai/code/session_01X86EZxqXEFiU9VetyhPKjM
None => {
tracing::debug!(
%thread_id,
"Thread disappeared before rejection could be recorded"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — Rejection path returns success when thread is gone

When the thread disappears before rejection can be recorded, the function logs a debug message but still sends StatusUpdate::Status("Rejected") and returns SubmissionResult::response(rejection). The caller and UI see a successful rejection for a thread that no longer exists.

More critically, the pending approval was already take()-n at line 1181 and consumed. If the thread disappears between take and the rejection path, the approval is permanently lost with no error surfaced.

Suggested fix: The None arm for the rejection branch should return an error (like the approval-path critical sites at lines 1267-1274 and 1691-1697), e.g., SubmissionResult::error("Thread disappeared during rejection").

None => {
tracing::debug!(
%thread_id,
"Thread disappeared before tool result could be recorded"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Tool result recording silently continues when thread gone

When the thread disappears before the tool result can be recorded, the code logs debug and continues. But the tool has already been executed (line 1307-1308). The context_messages still get the tool result appended (line 1389-1393), so the agentic loop may resume with context referencing a tool call whose result was never persisted in the thread.

This is partially mitigated by the error at line 1769-1770 but could cause a mismatch between LLM context and persisted state. A comment explaining why proceeding is safe would clarify intent.

None => {
tracing::debug!(
%thread_id,
"Thread disappeared before auth intercept could be applied"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Orphaned SSE emission for dead thread

When the thread disappears, the None arm logs debug but execution continues past the block to emit_auth_required_status(). This sends an AuthRequired SSE event to the client for a thread that no longer exists. The client will show an auth prompt that can never be completed. The caller (process_approval) subsequently returns SubmissionResult::auth_pending() for a dead thread.

Suggested fix: Return early from handle_auth_intercept when the thread is gone, or suppress the SSE emission.

None => {
tracing::debug!(
%thread_id,
"Thread disappeared before auth mode could be re-entered"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Orphaned SSE emission in process_auth_token

Two code paths in process_auth_token (the Ok(result) and Err(e) arms) re-enter auth mode. When the thread is gone, the None arm logs debug but execution continues to emit_auth_required_status(). This sends auth-required SSE events for a nonexistent thread.

Suggested fix: Add early return or skip the emit_auth_required_status call when the thread was not found.

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paranoid Architect Review — REQUEST CHANGES

1 High, 4 Medium findings.

The core TOCTOU fix (holding the lock for take-verify) is correct and the regression test validates it well. However, the silent-error-fallback fix is inconsistently applied:

  • High: The rejection path returns success to the caller when the thread is gone, permanently losing the approval with no error.
  • Medium: Two handle_auth_intercept / process_auth_token paths emit SSE events for dead threads — clients get stuck in auth prompts that can never resolve.
  • Medium (Finding 5 — not in diff, line ~374): One if let Some(thread) pattern at line ~374 in process_user_input (the Processing/queue path) was not converted, contrary to the PR's stated scope of "Replace 10 silent if let Some(thread) patterns." If the thread disappears between the state snapshot (line 344) and the mutable re-lock (line 373), the code silently falls through to the Idle | Completed match arm, which could incorrectly start a new turn on a removed thread. Suggested fix: Convert this to a match with explicit None handling, consistent with the other 9 sites.

The rejection path (finding #1) is the ship-blocker — it should return an error, not success, matching the approval path's behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: agent Agent core (agent loop, router, scheduler) scope: dependencies Dependency updates size: L 200-499 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[HIGH] Incomplete fallback logic for non-existent approval threads [CRITICAL] TOCTOU race condition in approval thread resolution

3 participants