Commit d91edc9
authored
fix: auto-redact tool_calls[*].function.arguments (#588)
* fix: auto-redact tool_calls[*].function.arguments
The un-redact path walked message.content / reasoning_content /
reasoning but not tool_calls[*].function.arguments, so when a model
emitted a tool call whose JSON arguments echoed the user's (now-
redacted) PII, the placeholder leaked to the client unchanged.
Caught while testing an agentic round-trip on staging:
POST /v1/chat/completions x-auto-redact: on
messages=[{user, "Send a thank-you to alice.chen@gmail.com ..."}]
tools=[{send_email(to, subject, body)}]
Provider (claude-sonnet-4-6) correctly received <email1><email2> in
its prompt, but the resulting tool_calls[0].function.arguments came
back as {"to":"<email1>",...} instead of {"to":"alice.chen@gmail",...}.
Fix:
- Non-streaming (unredact_chat_response_in_place): walk
message.tool_calls[*].function.arguments and apply map.unredact.
- Streaming (unredact_chunk_in_place + StreamUnredactStates): add a
per-(choice_index, tool_call_index) sliding-tail state map so
arguments JSON streamed in fragments handles split placeholders
the same way content does.
New e2e test auto_redact_unredacts_tool_call_arguments configures
the MockProvider to emit a send_email tool call with <email1> in
its arguments and asserts the client receives the substituted
original.
Known limitations (not addressed here; separate follow-ups):
- privacy-filter sometimes splits one PII into multiple adjacent
spans (alice.chen@gmail.com -> <email1><email2>), and a model that
treats placeholders as separate values may issue multiple tool
calls (one per placeholder). This is a privacy-filter span-merging
concern, not an un-redact bug.
- End-of-stream flush for tool_call_arguments tails is not yet
emitted as a synthetic chunk (the existing flush covers content
fields). Most tool-call args flows complete cleanly without mid-
token truncation, so deferring to a follow-up.
* fix: JSON-escape replacements + flush tool_call_arguments tail
Addresses claude-review on #588:
1. JSON corruption (blocker):
tool_calls[*].function.arguments is a JSON-encoded string.
map.unredact() did raw substring replacement, so PII originals
containing `"`, `\`, control chars, or non-ASCII would break the
surrounding JSON.
- Add RedactionMap::unredact_json_string which JSON-escapes each
replacement via serde_json::to_string + strip outer quotes.
- Add StreamUnredact::new_for_json_string variant for the streaming
path; routes substitutions through unredact_json_string.
- Non-streaming unredact_chat_response_in_place now uses the json
variant for tool_calls[*].function.arguments.
- Streaming unredact_chunk_in_place creates per-(choice_idx, tc_idx)
state via new_for_json_string.
Tests:
- placeholders::unredact_json_string_{escapes_quotes,
escapes_backslash, escapes_newline, safe_for_simple_pii}
- stream_unredact::json_string_variant_{escapes_quote_in_replacement,
escapes_across_chunk_split, no_op_for_simple_pii}
2. End-of-stream flush gap (blocker):
build_flush_chunks drained content/reasoning fields but ignored
tool_call_arguments — partial placeholders held in the tail were
silently dropped (or could leak as literal `<email1>` if the tail
contained an incomplete-but-recognizable shape).
- Extend build_flush_chunks with a second pass that drains
states.tool_call_arguments and emits a synthetic SSE chunk per
(choice_idx, tc_idx) with the held bytes as a tool_calls delta.
- Stable ordering: sort by (choice_idx, tc_idx) for deterministic
output.
- New e2e auto_redact_unredacts_tool_call_arguments_streaming
covers the streamed-args reassembly path end-to-end.
3. Nit (collision safety):
Streaming chunk handler no longer falls back to tc.index = 0 for
indexless tool calls — uses enumerate position instead, so two
parallel indexless tool calls in one delta don't collide on (idx, 0).
Test counts:
- services unit: 43 pass (was 38; +5 JSON-escape tests)
- e2e auto_redact: 10 pass (was 9; +1 streaming-args test)
- clippy -D warnings: clean
- cargo fmt --check: clean
* fix: close 3 remaining auto-redact leak paths + add timeout
Surfaced by an independent subagent code review of #588. Each is a
distinct privacy regression vs. the design goal of "provider never
sees raw PII".
1. **Input tool_calls.arguments now redacted.**
In an agent loop the user resubmits the assistant's prior tool_call
as part of conversation history. `collect_text_fragments` only
walked `message.content`, so the JSON arguments string (which often
echoes the original PII verbatim) was forwarded to the provider raw
on every follow-up turn.
- New `TextRef::ToolCallArg { msg_idx, tc_idx }` variant in
`apply.rs`; `collect_text_fragments` + `write_back` extended to
cover it. The arguments string is redacted as opaque text — our
placeholders are pure ASCII (`<emailN>`) so they stay valid JSON.
- Unit tests: collect_walks_assistant_tool_call_arguments,
write_back_updates_tool_call_arguments.
- E2E test: auto_redact_redacts_input_tool_call_arguments — sends a
4-turn history with a tool_call carrying bob@example.com and
asserts the provider sees <email1>, not the raw email.
2. **Response message.refusal now un-redacted.**
Safety-tuned models may quote our placeholders back ("I can't email
<email1> per policy"). Without un-redacting `choice.message.refusal`,
the placeholder leaked to the client.
- Walk the field in `unredact_chat_response_in_place`.
- E2E test: auto_redact_unredacts_refusal_field.
- `ChatDelta` has no `refusal` field, so streaming has no
corresponding gap.
3. **Privacy-filter error logs no longer leak upstream response body.**
`PrivacyClassifyError::HttpError` carries the verbatim response.text()
from the privacy-filter. A misbehaving filter that echoes its input
in an error response would have routed customer PII straight to
application logs via `tracing::warn!(error = %e, …)`.
- Added `Self::privacy_classify_error_category(&e)` returning a
bounded `&'static str` (`unauthorized`, `rate_limited`,
`unavailable`, `server_error`, `client_error`, `http_other`,
`request_failed`).
- The "all providers failed" final error now hand-formats the
status code only, bypassing `sanitize_error_message` which would
have re-introduced the body via `Display`.
- Demoted "Privacy classify completed successfully" from info to
debug — high-volume info log on every redacted request.
4. **15-second wall-clock timeout on the redact step.**
Detector retries cascade through `pool.privacy_classify` with each
attempt bounded by `completion_timeout()` (default 600s) × multiple
providers. Without an outer bound, a hung detector could hold the
user's request hostage for tens of minutes before the 503 fires.
Auto-redact is in the critical request path; we cap it tightly.
- New `REDACT_TIMEOUT = Duration::from_secs(15)` constant.
- `redact_messages` now wraps the inner work in `tokio::time::timeout`
and maps `Elapsed` to `DetectorUnavailable`.
Also includes 8 new adversarial e2e tests in
`auto_redact_adversarial.rs` (from a separate subagent's testing
pass): empty messages, empty content, PII in system messages, PII in
multimodal content-parts arrays, repeated-PII dedup, user-supplied
placeholder collision, 512 KB body, 26 MB body rejection.
Test counts:
- services unit: 47 pass (was 43; +4 collect/write/timeout tests)
- e2e auto_redact: 12 pass (was 10; +2 new tests)
- e2e auto_redact_adversarial: 8 pass (new file)
- clippy -D warnings: clean
- cargo fmt --check: clean1 parent 972de6c commit d91edc9
9 files changed
Lines changed: 1162 additions & 10 deletions
File tree
- crates
- api
- src/routes
- tests/e2e_all
- services/src
- auto_redact
- inference_provider_pool
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
355 | 355 | | |
356 | 356 | | |
357 | 357 | | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
358 | 364 | | |
359 | 365 | | |
360 | 366 | | |
| |||
369 | 375 | | |
370 | 376 | | |
371 | 377 | | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
372 | 394 | | |
373 | 395 | | |
374 | 396 | | |
375 | 397 | | |
376 | 398 | | |
377 | 399 | | |
378 | | - | |
| 400 | + | |
379 | 401 | | |
380 | 402 | | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
381 | 407 | | |
382 | 408 | | |
383 | 409 | | |
384 | 410 | | |
385 | 411 | | |
| 412 | + | |
386 | 413 | | |
387 | 414 | | |
388 | 415 | | |
| |||
419 | 446 | | |
420 | 447 | | |
421 | 448 | | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
422 | 472 | | |
423 | 473 | | |
424 | 474 | | |
| |||
445 | 495 | | |
446 | 496 | | |
447 | 497 | | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
448 | 501 | | |
449 | 502 | | |
450 | 503 | | |
| |||
465 | 518 | | |
466 | 519 | | |
467 | 520 | | |
468 | | - | |
469 | | - | |
470 | 521 | | |
471 | 522 | | |
472 | 523 | | |
| |||
501 | 552 | | |
502 | 553 | | |
503 | 554 | | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
504 | 616 | | |
505 | 617 | | |
506 | 618 | | |
| |||
0 commit comments