Skip to content

Adjacent axis: dispatch-boundary deception in the May 2026 sub-agent failure cluster (cc-safe-setup integration questions) #16

@yurukusa

Description

@yurukusa

Hi Fernando — I've been working on cc-safe-setup's operator-side defenses and shipped a four-hook cluster this month for what I think is an adjacent axis to AgentCloseoutBench's four categories (wrap_up / cliffhanger / roleplay_drift / sycophancy). Filing this as an issue rather than a PR because I'd like to think through the boundaries with you before either of us commits to a particular integration shape.

Axis Where it fires What it measures Detection surface
AgentCloseoutBench (your work) Closeout boundary (final assistant message) Deception in the assistant's text output NLP/regex on text
cc-safe-setup sub-agent cluster (my work) Dispatch boundary (PreToolUse-Agent and adjacent) Divergence between dispatch claims and session log Event-level (tool-call counts, receipts, allowlists)
The May 2026 GitHub issue cluster on anthropics/claude-code (#60987, #61102, #61107, #61167, #61315, #61405, #61547) consolidates into four sub-patterns at the dispatch boundary: dispatch fabrication (claim of completion with zero tool calls), silent stall (sub-agent blocks on hidden condition, blocked-state not propagated), absence of observation and control (12-hour silent hang), and scope expansion (sub-agent output treated as authorization rather than evidence).
The cleanest case in the cluster — #61167, nvst18's OpenClaw trauma therapy deployment — reads as a sycophancy-flavored failure at AgentCloseoutBench's level (Claude reported success with confidence) and a dispatch-fabrication failure at the cc-safe-setup level (5 verification agents returning success with 0 sessions per agent). The same incident exposes both axes.
1. Is sycophancy defined to include dispatch-claim sycophancy, or only conversational flattery? From DATASET_CARD.md it reads as the latter (operator-facing flattery about the user), but the boundary feels fuzzy. The closeout-side message "I successfully dispatched 39 specialized agents" (when 5 ran) is dispatch-fabrication operationally and sycophancy semantically. Two ways to read this:
  • (a) AgentCloseoutBench scopes to text-only deception; dispatch-substrate divergence is out of scope. Then the two benches are orthogonal and the integration story is "run both."
  • (b) AgentCloseoutBench's sycophancy quietly subsumes dispatch-claim sycophancy at the text level (the model says it did X). Then there's an interesting joint fixture: same record judged differently when the substrate is checkable vs not.
    I read the spec as (a), but want to confirm before drafting fixtures.
    2. Would a fixture pack from public-derived adversarial samples in the dispatch cluster be useful? I have permission-clean text excerpts from the seven issues above (the operators' own quoted descriptions of the failure narratives Claude generated). If those qualify as "public-derived adversarial samples" in your v0.3 corpus shape, I'd be happy to PR a small fixture pack (~10–15 records, properly category-tagged) and cite the corresponding GitHub issues. If they don't fit the v0.3 shape, no problem — happy to instead just publish them in cc-safe-setup as a separate closeout-narrative-fixtures/ directory and cross-reference.
  • Not asking you to depend on cc-safe-setup or vice versa.
  • Not asking for a hook adapter in either direction (the event-level vs text-level distinction makes a clean adapter hard).
  • Not proposing schema overlap with the receipt-persistence layer (PR #282/#283/#286/#298 on cc-safe-setup — that's a separate substrate for the dispatch axis and shouldn't pollute your closeout corpus).
    A short cross-reference section in each repo's README naming the other and the boundary between them. AgentCloseoutBench owns the closeout-text axis; cc-safe-setup's sub-agent cluster owns the dispatch-event axis. Operators install both for full surface coverage. The README sections cite the OpenClaw case (#61167) as the canonical example of an incident that exposes both axes.
    If you want a fixture pack PR, I'll draft and send. If you'd rather keep the corpus shape stable for v0.3, the README cross-reference alone is enough from my side.
    Related artifacts on the cc-safe-setup side, for context:
  • Four sub-pattern meta-analysis Gist (English, 2,270 words, MIT): https://gist.github.com/yurukusa/9857a9ed407696ba8483b354917ff161
  • Sub-Agent Observability wiki page: https://github.com/yurukusa/cc-safe-setup/wiki/Sub-Agent-Observability
  • Operator log of the 72-hour build (English, 1,571 words, MIT): https://gist.github.com/yurukusa/f98ab28bc11ea1c54a12c732dec857d5
    — yurukusa

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions