Skip to content

fix(hitl): unify when-to-gate (shadow) + supersede keying across chat/voice#838

Open
swaroopvarma1 wants to merge 1 commit into
releasefrom
fix/hitl-gate-policy-unify
Open

fix(hitl): unify when-to-gate (shadow) + supersede keying across chat/voice#838
swaroopvarma1 wants to merge 1 commit into
releasefrom
fix/hitl-gate-policy-unify

Conversation

@swaroopvarma1

Copy link
Copy Markdown
Collaborator

What

The two behavior-changing slices of the HITL "one policy, two transports" unification — both chat↔voice divergences resolved for correctness + consistency, with no new machinery. (The additive deny-on-terminal slice ships separately as the chat terminal-sweep PR.)

1. When-to-gate — shadow-gating

The approval config lives on a global function. In a node that defines a same-named per-node function, the LLM calls the per-node one (it shadows the global) — which the author never gated. Chat gated it anyway by flat name; voice didn't (its wrapper gates only globals). A documented divergence.

Fix: chat's partition is now node-aware (_partition_gated_calls) — a name that resolves to a per-node function in the current node stays ungated, matching voice. The gated global is still gated in every node that doesn't shadow it, so non-shadow nodes are byte-for-byte unchanged. build_approval_map's warning now describes the consistent behavior.

2. Supersede keying

Voice's ApprovalManager keyed supersede by function_name, so two distinct parallel calls to the same function (e.g. add_to_cart(A) + add_to_cart(B)) had the second silently drop the first.

Fix: key supersede by function_name + an args fingerprint. A true re-call (same fn, same args) still supersedes — so an async re-call can't double-execute a refund — but distinct parallel calls each keep their own card. (This is the safe middle between "keep function-name keying" = drops distinct calls, and "remove supersede" = allows double-execution.)

Why these choices

The HITL policy is already half-unified (gate_call + the status vocabulary are shared). For each forked decision I picked the option that's correct, fail-safe, and lowest-complexity:

  • Shadow: don't gate what the author didn't mark, and stay consistent across channels — without node-awareness in voice (out of scope) or un-gating a real global.
  • Supersede: the args fingerprint distinguishes a re-call from distinct parallel calls — the only signal that actually separates them — so neither failure mode (drop vs double-execute) occurs.

Verification

  • tests/test_chat_gate_partition.py — shadow / non-shadow / function_name alias / empty node / ungated.
  • tests/test_approval_manager.py — updated: identical re-call supersedes; distinct parallel args do NOT (the dropped-call fix).
  • pyrefly 0 errors; full suite 447 passed; field_reference coverage green.

…/voice

Two HITL policy decisions that diverged between chat and voice — resolved for
correctness + cross-channel consistency, no new machinery.

1. WHEN-TO-GATE (shadow-gating). A gated GLOBAL function whose name is
   shadowed by a same-named per-node function diverged: chat gated it by flat
   name (gating the per-node function the author never marked), voice did not
   (its wrapper gates only globals). Make chat's partition NODE-AWARE
   (_partition_gated_calls): a name that resolves to a per-node function in
   the current node stays UNGATED — the gated global is still gated in every
   node that doesn't shadow it. Chat now matches voice. The build_approval_map
   warning is updated to describe the new, consistent behavior. Non-shadow
   nodes are byte-for-byte unchanged.

2. SUPERSEDE keying. The voice ApprovalManager keyed supersede by
   function_name, so two DISTINCT parallel calls to the same function (e.g.
   two add_to_cart with different args) had the second silently drop the
   first. Key by function_name + an args fingerprint instead: a true re-call
   (same fn, same args) still supersedes — so an async re-call can't
   double-execute — but distinct parallel calls each keep their own card.

Both are part of the HITL "one policy, two transports" unification; the
deny-on-terminal slice (chat terminal sweep) ships in its own PR.

Tests: tests/test_chat_gate_partition.py (shadow / non-shadow / alias / empty
node); tests/test_approval_manager.py updated (identical re-call supersedes;
distinct parallel args do NOT). pyrefly 0 errors; full suite 447 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@swaroopvarma1, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 58 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d624b9d-866f-498b-b4e1-7ab91c2ec85a

📥 Commits

Reviewing files that changed from the base of the PR and between 58d3fbc and 7410624.

📒 Files selected for processing (5)
  • app/ai/voice/agents/breeze_buddy/agent/approval.py
  • app/ai/voice/agents/breeze_buddy/chat/agent.py
  • app/ai/voice/agents/breeze_buddy/template/approval.py
  • tests/test_approval_manager.py
  • tests/test_chat_gate_partition.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/hitl-gate-policy-unify

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Tara-ag Tara-ag left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Files reviewed: 5
New issues: 0

Changes Overview

This PR unifies two behavior-changing aspects of HITL "one policy, two transports":

  1. Shadow-gating fix (_partition_gated_calls): Chat now correctly handles per-node function shadowing of gated globals, matching voice behavior. A per-node function with the same name as a gated global is now correctly left ungated in nodes that define it.

  2. Supersede keying fix (_dedupe_key): Voice's ApprovalManager now keys supersede by function_name + args fingerprint instead of just function_name. This fixes the bug where two distinct parallel calls to the same function (e.g., add_to_cart(A) + add_to_cart(B)) would have the second silently drop the first.

Analysis

Security: ✅ No concerns

  • No SQL queries in changed code (in-memory dict operations only)
  • No new endpoints or auth changes
  • JSON serialization uses default=str fallback consistently with existing codebase patterns

Correctness: ✅ Well-implemented

  • _dedupe_key() uses stable JSON (sort_keys=True) with defensive fallback to function-name-only on serialization errors
  • _partition_gated_calls() correctly handles both name and function_name keys for per-node functions
  • Edge cases handled: empty/missing node functions, ungated calls, function name aliases

Testing: ✅ Comprehensive

  • test_chat_gate_partition.py: 5 test cases covering shadow/non-shadow/alias/empty/ungated scenarios
  • test_approval_manager.py: Updated existing test and added new test for distinct parallel args

Documentation: ✅ Clear

  • Docstrings explain the rationale for both changes
  • Warning message in build_approval_map updated to reflect unified behavior

Verification

  • pyrefly: 0 errors (per PR description)
  • Full suite: 447 passed (per PR description)
  • No migration files modified (compliant with append-only rule)

Decision: Approve - Clean, focused fix with good test coverage and clear documentation.

Non-shadow nodes are unaffected — a gated global is still gated. This
keeps chat consistent with voice, whose wrapper gates only globals.
"""
node_fn_names = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟥 [CRITICAL — functional] The node-aware shadow logic never fires in production — the chat half of this PR is a no-op

node_fn_names only collects functions where isinstance(f, dict). But at runtime the node's functions are FlowsFunctionSchema dataclass objects, not plain dicts — the same file confirms this: _dispatch_tool_call (lines 1137-1141) does isinstance(fn, FlowsFunctionSchema) + fn.name, and _tools_schema (lines 1197-1198) does fn.to_function_schema() if isinstance(fn, FlowsFunctionSchema) else fn. FlowsFunctionSchema has no .get(). So node_fn_names is always empty, c.function_name not in node_fn_names is always True, and partitioning collapses to the old function_name in approval_map behavior.

Net: the chat/voice divergence this PR sets out to fix still exists — a gated global that is shadowed by a same-named per-node function is STILL gated on chat while voice runs it ungated. (Reproduced: with a plain-dict node gated=[]; with a FlowsFunctionSchema node gated=['issue_refund'].)

Fix: match the idiom used by the sibling helpers:

node_fn_names = {fn.name for fn in (node.get("functions") or []) if isinstance(fn, FlowsFunctionSchema)}

(Drop the function_name alias fallback — the function_namename rename happens earlier in the builder, before the schema exists.)

def test_partition_skips_gated_name_shadowed_by_per_node_function():
calls = [_call("issue_refund")]
approval_map = {"issue_refund": object()}
node = {"functions": [{"name": "issue_refund"}]} # per-node shadows it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟧 [MAJOR — functional/test] These tests pass while the production code is broken (they mask the bug above)

Every case builds node from plain dicts ({"functions": [{"name": "issue_refund"}]}), which satisfies the isinstance(f, dict) filter — so the shadow-exclusion path does run here. In production, nodes hold FlowsFunctionSchema objects, where that filter yields an empty set and the exclusion silently never applies. The suite is green (441 passed) precisely because it never exercises the production shape, so nothing catches the CRITICAL no-op on _partition_gated_calls.

Fix: add a case that constructs the node with a real FlowsFunctionSchema(name=..., description=..., properties={}, required=[], handler=...) (or just [FlowsFunctionSchema(name="issue_refund", ...)]); with the current code it will fail, and after the fix in _partition_gated_calls it will pass.

@narsimhaReddyJuspay

Copy link
Copy Markdown
Contributor

PR #838fix(hitl): unify when-to-gate (shadow) + supersede keying across chat/voice

Verdict: request changes — 1 critical + 1 major. Gates green: black/isort/autoflake/pyrefly ✅, pytest 441 passed — note that the green suite actually confirms the masking problem below (the tests pass despite the production bug).

🟥 Critical / 🟧 Major (inline comments posted)

  1. CRITICAL — _partition_gated_calls node-aware shadow logic is a no-op in production. It filters isinstance(f, dict), but node functions are FlowsFunctionSchema objects (the rest of the same file handles them as such), so node_fn_names is always empty and partitioning falls back to the old behavior. The chat/voice divergence the PR intends to fix still exists. (chat/agent.py:95)
  2. MAJOR — the new tests build nodes from plain dicts, so they pass while the code is broken. Add a case with a real FlowsFunctionSchema node. (tests/test_chat_gate_partition.py)

✅ What's good (verified)

  • Voice supersede keying is correct. _dedupe_key + _by_request: distinct-args parallel calls each keep their own card; identical re-calls supersede; the finally cleanup preserves the newer owner's _by_request entry and only pops on matching approval_id — no leak, no cross-call collision (ApprovalManager is per-call). Tests (test_manager_supersede_on_identical_recall, test_manager_no_supersede_on_distinct_parallel_args) genuinely assert the new behavior.
  • Fallback to function_name-only keying on JSON-serialization failure is a safe degradation to the old coarser behavior.
  • "Shadow" here means a per-node function shadowing a gated global's name — there's no separate "evaluate-but-don't-enforce" flag, so no off-by-one enforcement risk.
  • Chat HITL is keyed by tool_call_id in DB rows and doesn't use ApprovalManager/dedupe, so the voice dedupe change can't affect chat's resolve/supersede.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants