fix(guardrails): scan Anthropic tool_result content blocks#29594
fix(guardrails): scan Anthropic tool_result content blocks#29594Anai-Guo wants to merge 3 commits into
Conversation
The Generic Guardrail API's _extract_input_text_and_images only read the "text" key from list content blocks, so Anthropic tool_result blocks (whose text lives under "content") were silently skipped. Tool outputs such as file reads and API responses bypassed all guardrail scanning, a PII/secret-leak gap in agentic workflows. Also handles the list form of tool_result content (blocks of type text).
Greptile SummaryThis PR fixes a gap where Anthropic
Confidence Score: 3/5The scanning fix is a genuine improvement, but a write-back mismatch means redacting guardrails will not actually update Extraction of litellm/llms/anthropic/chat/guardrail_translation/handler.py — both the new extraction logic and the existing
|
| Filename | Overview |
|---|---|
| litellm/llms/anthropic/chat/guardrail_translation/handler.py | Adds tool_result content extraction to guardrail scanning; extraction logic is correct, but the write-back path in _apply_guardrail_responses_to_input writes to ["text"] instead of ["content"], so redacting guardrails would silently fail to update the actual tool_result payload. |
Reviews (1): Last reviewed commit: "fix(guardrails): scan Anthropic tool_res..." | Re-trigger Greptile
| if content_item.get("type") == "tool_result": | ||
| tool_result_content = content_item.get("content") | ||
| if isinstance(tool_result_content, str): | ||
| texts_to_check.append(tool_result_content) | ||
| task_mappings.append((msg_idx, int(content_idx))) | ||
| elif isinstance(tool_result_content, list): | ||
| for block in tool_result_content: | ||
| if isinstance(block, dict): | ||
| block_text = block.get("text") | ||
| if block_text is not None: | ||
| texts_to_check.append(block_text) | ||
| task_mappings.append( | ||
| (msg_idx, int(content_idx)) | ||
| ) |
There was a problem hiding this comment.
Write-back mismatch for
tool_result blocks
The extraction correctly reads from content_item["content"], but the write-back in _apply_guardrail_responses_to_input (lines 307-311) always writes to content_item["text"]. For tool_result blocks, task_mappings stores (msg_idx, content_idx) pointing at the outer tool_result dict — so the write-back executes tool_result_block["text"] = guardrail_response, silently adding a spurious key while leaving tool_result_block["content"] (the actual sensitive text) untouched. Any guardrail that modifies or redacts content (rather than just raising an exception) will appear to succeed but will actually leave the original tool_result content in the payload unchanged.
| if content_item.get("type") == "tool_result": | ||
| tool_result_content = content_item.get("content") | ||
| if isinstance(tool_result_content, str): | ||
| texts_to_check.append(tool_result_content) | ||
| task_mappings.append((msg_idx, int(content_idx))) | ||
| elif isinstance(tool_result_content, list): | ||
| for block in tool_result_content: | ||
| if isinstance(block, dict): | ||
| block_text = block.get("text") | ||
| if block_text is not None: | ||
| texts_to_check.append(block_text) | ||
| task_mappings.append( | ||
| (msg_idx, int(content_idx)) | ||
| ) |
There was a problem hiding this comment.
No tests provided for the bug fix
The PR description's test plan checkboxes are all unchecked ([ ]), and there are no new test cases in tests/test_litellm/llms/anthropic/chat/guardrail_translation/test_anthropic_guardrail_handler.py covering tool_result scanning. Without automated tests for the string-content case, the list-of-blocks case, and a regression check on plain text/image blocks, a future refactor of this extraction path has no safety net.
Rule Used: What: Ensure that any PR claiming to fix an issue ... (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
PR overviewThis pull request updates Anthropic chat guardrail translation so There is one open issue remaining: the configured option to skip tool messages is not yet applied to Anthropic Open issues (1)
Fixed/addressed: 1 · PR risk: 4/10 |
|
Addressed the write-back gap flagged in review: sanitized The task mapping is now a
The output-path mapping is untouched. Net change is +31/-12 in 🤖 Generated with Claude Code |
| # outputs (file reads, API responses) bypass guardrail scanning. | ||
| # The block index is tracked so sanitized text is written back to | ||
| # the correct location (see _apply_guardrail_responses_to_input). | ||
| if content_item.get("type") == "tool_result": |
There was a problem hiding this comment.
Medium: Tool-message skip is not applied to Anthropic tool results
skip_tool_message_in_guardrail only returns early for messages whose role is tool, but Anthropic tool outputs are tool_result blocks inside a user message. With this extraction path, a caller that configured tool messages to be skipped can still have tool_result["content"] sent to apply_guardrail; skip these blocks before extracting their text when skip_tool_message is true.
Summary
Fixes #29593.
The Generic Guardrail API's
_extract_input_text_and_images()(inlitellm/llms/anthropic/chat/guardrail_translation/handler.py) only reads the"text"key when iterating list content blocks. Anthropictool_resultblocks carry their text under"content"(not"text"), so tool outputs were silently skipped and never added totexts_to_check.This means content returned by tools — file reads, API responses, database rows, etc. — bypassed all guardrails built on the Generic Guardrail API. In agentic workflows (Cursor, coding agents, MCP file reads) this is a PII/secret-leak gap: a tool result containing an API key or PII would never be scanned.
Change
In the list-content branch, after the existing
textextraction, also handletool_resultblocks:contentas a string → appended directly.contentas a list of blocks → eachtextblock's text is appended.Both forms are part of the Anthropic tool_result spec. Images and other block types are untouched. 18 lines added, no existing behavior changed.
Test plan
tool_result(stringcontent) now scans the tool output.tool_resultwhosecontentis a list of text blocks scans each block.🤖 Generated with Claude Code