Skip to content

fix(guardrails): scan Anthropic tool_result content blocks#29594

Open
Anai-Guo wants to merge 3 commits into
BerriAI:mainfrom
Anai-Guo:fix/guardrail-tool-result-redaction
Open

fix(guardrails): scan Anthropic tool_result content blocks#29594
Anai-Guo wants to merge 3 commits into
BerriAI:mainfrom
Anai-Guo:fix/guardrail-tool-result-redaction

Conversation

@Anai-Guo
Copy link
Copy Markdown
Contributor

@Anai-Guo Anai-Guo commented Jun 3, 2026

Summary

Fixes #29593.

The Generic Guardrail API's _extract_input_text_and_images() (in litellm/llms/anthropic/chat/guardrail_translation/handler.py) only reads the "text" key when iterating list content blocks. Anthropic tool_result blocks carry their text under "content" (not "text"), so tool outputs were silently skipped and never added to texts_to_check.

This means content returned by tools — file reads, API responses, database rows, etc. — bypassed all guardrails built on the Generic Guardrail API. In agentic workflows (Cursor, coding agents, MCP file reads) this is a PII/secret-leak gap: a tool result containing an API key or PII would never be scanned.

Change

In the list-content branch, after the existing text extraction, also handle tool_result blocks:

  • content as a string → appended directly.
  • content as a list of blocks → each text block's text is appended.

Both forms are part of the Anthropic tool_result spec. Images and other block types are untouched. 18 lines added, no existing behavior changed.

Test plan

  • Guardrail with a tool_result (string content) now scans the tool output.
  • Guardrail with a tool_result whose content is a list of text blocks scans each block.
  • Plain text / image blocks behave exactly as before (no regression).

🤖 Generated with Claude Code

The Generic Guardrail API's _extract_input_text_and_images only read the
"text" key from list content blocks, so Anthropic tool_result blocks
(whose text lives under "content") were silently skipped. Tool outputs
such as file reads and API responses bypassed all guardrail scanning,
a PII/secret-leak gap in agentic workflows. Also handles the list form
of tool_result content (blocks of type text).
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Jun 3, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing Anai-Guo:fix/guardrail-tool-result-redaction (2a4847e) with main (5be0797)

Open in CodSpeed

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR fixes a gap where Anthropic tool_result content blocks were silently skipped by the Generic Guardrail API's _extract_input_text_and_images(), allowing tool outputs (file reads, API responses, etc.) to bypass all guardrail scanning.

  • Adds tool_result handling in the list-content branch: both string content and list-of-text-blocks content are now extracted and appended to texts_to_check.
  • The extraction logic is correct, but the write-back in _apply_guardrail_responses_to_input always writes guardrailed text to block[\"text\"] — for tool_result blocks the actual field is block[\"content\"], so any guardrail that modifies or redacts content (rather than just blocking) will silently leave the original sensitive data in the message payload unchanged.
  • No automated tests were added for the new code path, leaving the string-content case, the list-of-blocks case, and the regression guard for plain text/images without coverage.

Confidence Score: 3/5

The scanning fix is a genuine improvement, but a write-back mismatch means redacting guardrails will not actually update tool_result content — a silent correctness failure on the modified code path.

Extraction of tool_result text for scanning is correct and closes a real gap. However, _apply_guardrail_responses_to_input unconditionally writes the guardrailed text to ["text"] on the content-list item, while tool_result blocks store their text under ["content"]. For any guardrail that rewrites content (PII redaction, etc.) the fix leaves the original sensitive text intact and adds a spurious key — a present, observable data-integrity defect on the newly added code path. No tests were added to catch this or future regressions.

litellm/llms/anthropic/chat/guardrail_translation/handler.py — both the new extraction logic and the existing _apply_guardrail_responses_to_input write-back need attention.

Important Files Changed

Filename Overview
litellm/llms/anthropic/chat/guardrail_translation/handler.py Adds tool_result content extraction to guardrail scanning; extraction logic is correct, but the write-back path in _apply_guardrail_responses_to_input writes to ["text"] instead of ["content"], so redacting guardrails would silently fail to update the actual tool_result payload.

Reviews (1): Last reviewed commit: "fix(guardrails): scan Anthropic tool_res..." | Re-trigger Greptile

Comment on lines +243 to +256
if content_item.get("type") == "tool_result":
tool_result_content = content_item.get("content")
if isinstance(tool_result_content, str):
texts_to_check.append(tool_result_content)
task_mappings.append((msg_idx, int(content_idx)))
elif isinstance(tool_result_content, list):
for block in tool_result_content:
if isinstance(block, dict):
block_text = block.get("text")
if block_text is not None:
texts_to_check.append(block_text)
task_mappings.append(
(msg_idx, int(content_idx))
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Write-back mismatch for tool_result blocks

The extraction correctly reads from content_item["content"], but the write-back in _apply_guardrail_responses_to_input (lines 307-311) always writes to content_item["text"]. For tool_result blocks, task_mappings stores (msg_idx, content_idx) pointing at the outer tool_result dict — so the write-back executes tool_result_block["text"] = guardrail_response, silently adding a spurious key while leaving tool_result_block["content"] (the actual sensitive text) untouched. Any guardrail that modifies or redacts content (rather than just raising an exception) will appear to succeed but will actually leave the original tool_result content in the payload unchanged.

Comment on lines +243 to +256
if content_item.get("type") == "tool_result":
tool_result_content = content_item.get("content")
if isinstance(tool_result_content, str):
texts_to_check.append(tool_result_content)
task_mappings.append((msg_idx, int(content_idx)))
elif isinstance(tool_result_content, list):
for block in tool_result_content:
if isinstance(block, dict):
block_text = block.get("text")
if block_text is not None:
texts_to_check.append(block_text)
task_mappings.append(
(msg_idx, int(content_idx))
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No tests provided for the bug fix

The PR description's test plan checkboxes are all unchecked ([ ]), and there are no new test cases in tests/test_litellm/llms/anthropic/chat/guardrail_translation/test_anthropic_guardrail_handler.py covering tool_result scanning. Without automated tests for the string-content case, the list-of-blocks case, and a regression check on plain text/image blocks, a future refactor of this extraction path has no safety net.

Rule Used: What: Ensure that any PR claiming to fix an issue ... (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 12.50000% with 21 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ms/anthropic/chat/guardrail_translation/handler.py 12.50% 21 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment thread litellm/llms/anthropic/chat/guardrail_translation/handler.py Outdated
@veria-ai
Copy link
Copy Markdown
Contributor

veria-ai Bot commented Jun 3, 2026

PR overview

This pull request updates Anthropic chat guardrail translation so tool_result content blocks are included in the guardrail scanning path. The change is focused on handling Anthropic’s tool output representation inside user messages.

There is one open issue remaining: the configured option to skip tool messages is not yet applied to Anthropic tool_result blocks, so tool output may still be sent through guardrail processing when the caller intended to exclude it. This mainly affects deployments that rely on that skip setting, and one prior issue has already been addressed.

Open issues (1)

Fixed/addressed: 1 · PR risk: 4/10

@Anai-Guo
Copy link
Copy Markdown
Contributor Author

Anai-Guo commented Jun 3, 2026

Addressed the write-back gap flagged in review: sanitized tool_result text was previously written back to content_item["text"], but Anthropic tool_result text lives under content (string or list of blocks). That meant redactions in sanitize-mode were silently dropped for tool outputs.

The task mapping is now a (msg_idx, content_idx, block_idx) tuple so _apply_guardrail_responses_to_input can route each sanitized segment to the correct location:

  • tool_result with string content → written back to content
  • tool_result with list-of-blocks content → written back to content[block_idx]["text"]
  • regular text blocks → unchanged (["text"])

The output-path mapping is untouched. Net change is +31/-12 in handler.py.

🤖 Generated with Claude Code

# outputs (file reads, API responses) bypass guardrail scanning.
# The block index is tracked so sanitized text is written back to
# the correct location (see _apply_guardrail_responses_to_input).
if content_item.get("type") == "tool_result":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Tool-message skip is not applied to Anthropic tool results

skip_tool_message_in_guardrail only returns early for messages whose role is tool, but Anthropic tool outputs are tool_result blocks inside a user message. With this extraction path, a caller that configured tool messages to be skipped can still have tool_result["content"] sent to apply_guardrail; skip these blocks before extracting their text when skip_tool_message is true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant