fix(security): harden approval thread safety (TOCTOU + error handling)#2366
fix(security): harden approval thread safety (TOCTOU + error handling)#2366
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses TOCTOU races in the process_approval function by ensuring that pending approvals are taken and verified under a single lock. It also improves error handling across several thread operations by explicitly checking for missing threads and providing appropriate logging or error returns. Regression tests were added to verify the fix for the race condition and the handling of missing threads. Feedback suggests enhancing visibility into unexpected states by adding warning logs when a thread has no turns during tool result recording.
| match sess.threads.get_mut(&thread_id) { | ||
| Some(thread) => { | ||
| if let Some(turn) = thread.last_turn_mut() { | ||
| if is_tool_error { | ||
| turn.record_tool_error_for( | ||
| &pending.tool_call_id, | ||
| result_content.clone(), | ||
| ); | ||
| } else { | ||
| turn.record_tool_result_for( | ||
| &pending.tool_call_id, | ||
| serde_json::json!(result_content), | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| None => { | ||
| tracing::debug!( | ||
| %thread_id, | ||
| "Thread disappeared before tool result could be recorded" | ||
| ); | ||
| } |
There was a problem hiding this comment.
While the change correctly handles the case where the thread is missing, it still silently ignores the case where thread.last_turn_mut() returns None. Given the objective to harden error handling and avoid silent fallbacks, this unexpected state should be surfaced via tracing::warn! to aid in troubleshooting, consistent with the repository's practice for handling persistence and task failures.
match sess.threads.get_mut(&thread_id) {
Some(thread) => {
match thread.last_turn_mut() {
Some(turn) => {
if is_tool_error {
turn.record_tool_error_for(
&pending.tool_call_id,
result_content.clone(),
);
} else {
turn.record_tool_result_for(
&pending.tool_call_id,
serde_json::json!(result_content),
);
}
}
None => {
tracing::warn!(
%thread_id,
"Thread has no turns; tool result could not be recorded"
);
}
}
}
None => {
tracing::warn!(
%thread_id,
"Thread disappeared before tool result could be recorded"
);
}
}References
- Errors and unexpected states in persistence or task execution should be surfaced via tracing::warn! to ensure reliability and visibility.
| match sess.threads.get_mut(&thread_id) { | ||
| Some(thread) => { | ||
| if let Some(turn) = thread.last_turn_mut() { | ||
| if is_deferred_error { | ||
| turn.record_tool_error_for(&tc.id, deferred_content.clone()); | ||
| } else { | ||
| turn.record_tool_result_for( | ||
| &tc.id, | ||
| serde_json::json!(deferred_content), | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| None => { | ||
| tracing::debug!( | ||
| %thread_id, | ||
| tool = %tc.name, | ||
| "Thread disappeared before deferred tool result could be recorded" | ||
| ); | ||
| } |
There was a problem hiding this comment.
Similar to the tool result recording path above, the case where thread.last_turn_mut() is None is currently handled silently. Adding a warning log for this scenario ensures consistency in error visibility across the codebase and avoids silent failures during tool result recording.
match sess.threads.get_mut(&thread_id) {
Some(thread) => {
match thread.last_turn_mut() {
Some(turn) => {
if is_deferred_error {
turn.record_tool_error_for(&tc.id, deferred_content.clone());
} else {
turn.record_tool_result_for(
&tc.id,
serde_json::json!(deferred_content),
);
}
}
None => {
tracing::warn!(
%thread_id,
tool = %tc.name,
"Thread has no turns; deferred tool result could not be recorded"
);
}
}
}
None => {
tracing::warn!(
%thread_id,
tool = %tc.name,
"Thread disappeared before deferred tool result could be recorded"
);
}
}References
- Surface unexpected failures or state inconsistencies via tracing::warn! to ensure reliability and visibility, as seen in rules for audit persistence and task joining.
Consolidates two security fixes for the approval processing flow in thread_ops.rs: **TOCTOU race (#1486):** Hold session lock for the entire take-verify sequence in process_approval() so pending approval cannot be lost if a concurrent operation modifies the thread between take and restore. Previously, the lock was dropped after take_pending_approval() and re-acquired for request_id verification, creating a window where the approval could be permanently lost. **Silent error fallback (#1487):** Replace 10 silent `if let Some(thread)` patterns with explicit `match` arms. Critical paths (state transitions, deferred approval setup) return errors when threads disappear. Non-critical paths (tool result recording, auth mode, rejection) log debug messages but continue. Regression tests: - test_approval_request_id_mismatch_restores_pending - test_approval_on_missing_thread_should_error Supersedes #1591 (branch had no merge base with current staging). Closes #1486, Closes #1487 https://claude.ai/code/session_01X86EZxqXEFiU9VetyhPKjM
dd9528b to
1e93133
Compare
| None => { | ||
| tracing::debug!( | ||
| %thread_id, | ||
| "Thread disappeared before rejection could be recorded" |
There was a problem hiding this comment.
High — Rejection path returns success when thread is gone
When the thread disappears before rejection can be recorded, the function logs a debug message but still sends StatusUpdate::Status("Rejected") and returns SubmissionResult::response(rejection). The caller and UI see a successful rejection for a thread that no longer exists.
More critically, the pending approval was already take()-n at line 1181 and consumed. If the thread disappears between take and the rejection path, the approval is permanently lost with no error surfaced.
Suggested fix: The None arm for the rejection branch should return an error (like the approval-path critical sites at lines 1267-1274 and 1691-1697), e.g., SubmissionResult::error("Thread disappeared during rejection").
| None => { | ||
| tracing::debug!( | ||
| %thread_id, | ||
| "Thread disappeared before tool result could be recorded" |
There was a problem hiding this comment.
Medium — Tool result recording silently continues when thread gone
When the thread disappears before the tool result can be recorded, the code logs debug and continues. But the tool has already been executed (line 1307-1308). The context_messages still get the tool result appended (line 1389-1393), so the agentic loop may resume with context referencing a tool call whose result was never persisted in the thread.
This is partially mitigated by the error at line 1769-1770 but could cause a mismatch between LLM context and persisted state. A comment explaining why proceeding is safe would clarify intent.
| None => { | ||
| tracing::debug!( | ||
| %thread_id, | ||
| "Thread disappeared before auth intercept could be applied" |
There was a problem hiding this comment.
Medium — Orphaned SSE emission for dead thread
When the thread disappears, the None arm logs debug but execution continues past the block to emit_auth_required_status(). This sends an AuthRequired SSE event to the client for a thread that no longer exists. The client will show an auth prompt that can never be completed. The caller (process_approval) subsequently returns SubmissionResult::auth_pending() for a dead thread.
Suggested fix: Return early from handle_auth_intercept when the thread is gone, or suppress the SSE emission.
| None => { | ||
| tracing::debug!( | ||
| %thread_id, | ||
| "Thread disappeared before auth mode could be re-entered" |
There was a problem hiding this comment.
Medium — Orphaned SSE emission in process_auth_token
Two code paths in process_auth_token (the Ok(result) and Err(e) arms) re-enter auth mode. When the thread is gone, the None arm logs debug but execution continues to emit_auth_required_status(). This sends auth-required SSE events for a nonexistent thread.
Suggested fix: Add early return or skip the emit_auth_required_status call when the thread was not found.
serrrfirat
left a comment
There was a problem hiding this comment.
Paranoid Architect Review — REQUEST CHANGES
1 High, 4 Medium findings.
The core TOCTOU fix (holding the lock for take-verify) is correct and the regression test validates it well. However, the silent-error-fallback fix is inconsistently applied:
- High: The rejection path returns success to the caller when the thread is gone, permanently losing the approval with no error.
- Medium: Two
handle_auth_intercept/process_auth_tokenpaths emit SSE events for dead threads — clients get stuck in auth prompts that can never resolve. - Medium (Finding 5 — not in diff, line ~374): One
if let Some(thread)pattern at line ~374 inprocess_user_input(the Processing/queue path) was not converted, contrary to the PR's stated scope of "Replace 10 silentif let Some(thread)patterns." If the thread disappears between the state snapshot (line 344) and the mutable re-lock (line 373), the code silently falls through to theIdle | Completedmatch arm, which could incorrectly start a new turn on a removed thread. Suggested fix: Convert this to amatchwith explicitNonehandling, consistent with the other 9 sites.
The rejection path (finding #1) is the ship-blocker — it should return an error, not success, matching the approval path's behavior.
Summary
Consolidates two security fixes for the approval processing flow in
thread_ops.rs:TOCTOU race ([CRITICAL] TOCTOU race condition in approval thread resolution #1486): Hold session lock for the entire take-verify sequence in
process_approval()so pending approval cannot be lost if a concurrent operation modifies the thread between take and restore. Previously, the lock was dropped aftertake_pending_approval()and re-acquired forrequest_idverification, creating a window where the approval could be permanently lost.Silent error fallback ([HIGH] Incomplete fallback logic for non-existent approval threads #1487): Replace 10 silent
if let Some(thread)patterns with explicitmatcharms. Critical paths (state transitions, deferred approval setup) return errors when threads disappear. Non-critical paths (tool result recording, auth mode, rejection) log debug messages but continue.Files changed
src/agent/thread_ops.rs— both fixes + 2 regression testsTest plan
test_approval_request_id_mismatch_restores_pendingtest_approval_on_missing_thread_should_errorcargo checkcleanSupersedes #1591 (branch had no merge base with current staging after repo restructuring).
Closes #1486, Closes #1487
https://claude.ai/code/session_01X86EZxqXEFiU9VetyhPKjM