Adjacent axis: dispatch-boundary deception in the May 2026 sub-agent failure cluster (cc-safe-setup integration questions)

Hi Fernando — I've been working on cc-safe-setup's operator-side defenses and shipped a four-hook cluster this month for what I think is an adjacent axis to AgentCloseoutBench's four categories (`wrap_up` / `cliffhanger` / `roleplay_drift` / `sycophancy`). Filing this as an issue rather than a PR because I'd like to think through the boundaries with you before either of us commits to a particular integration shape.
| Axis | Where it fires | What it measures | Detection surface |
|---|---|---|---|
| AgentCloseoutBench (your work) | Closeout boundary (final assistant message) | Deception in the assistant's text output | NLP/regex on text |
| cc-safe-setup sub-agent cluster (my work) | Dispatch boundary (PreToolUse-Agent and adjacent) | Divergence between dispatch claims and session log | Event-level (tool-call counts, receipts, allowlists) |
The May 2026 GitHub issue cluster on `anthropics/claude-code` ([#60987, #61102, #61107, #61167, #61315, #61405, #61547](https://gist.github.com/yurukusa/9857a9ed407696ba8483b354917ff161)) consolidates into four sub-patterns at the dispatch boundary: **dispatch fabrication** (claim of completion with zero tool calls), **silent stall** (sub-agent blocks on hidden condition, blocked-state not propagated), **absence of observation and control** (12-hour silent hang), and **scope expansion** (sub-agent output treated as authorization rather than evidence).
The cleanest case in the cluster — [#61167](https://github.com/anthropics/claude-code/issues/61167), nvst18's OpenClaw trauma therapy deployment — reads as a `sycophancy`-flavored failure at AgentCloseoutBench's level (Claude reported success with confidence) **and** a dispatch-fabrication failure at the cc-safe-setup level (5 verification agents returning success with 0 sessions per agent). The same incident exposes both axes.
**1. Is `sycophancy` defined to include dispatch-claim sycophancy, or only conversational flattery?** From `DATASET_CARD.md` it reads as the latter (operator-facing flattery about the user), but the boundary feels fuzzy. The closeout-side message *"I successfully dispatched 39 specialized agents"* (when 5 ran) is dispatch-fabrication operationally and sycophancy semantically. Two ways to read this:
  - **(a)** AgentCloseoutBench scopes to text-only deception; dispatch-substrate divergence is out of scope. Then the two benches are orthogonal and the integration story is *"run both."*
  - **(b)** AgentCloseoutBench's `sycophancy` quietly subsumes dispatch-claim sycophancy at the text level (the model says it did X). Then there's an interesting joint fixture: same record judged differently when the substrate is checkable vs not.
I read the spec as **(a)**, but want to confirm before drafting fixtures.
**2. Would a fixture pack from public-derived adversarial samples in the dispatch cluster be useful?** I have permission-clean text excerpts from the seven issues above (the operators' own quoted descriptions of the failure narratives Claude generated). If those qualify as "public-derived adversarial samples" in your `v0.3` corpus shape, I'd be happy to PR a small fixture pack (~10–15 records, properly category-tagged) and cite the corresponding GitHub issues. If they don't fit the v0.3 shape, no problem — happy to instead just publish them in cc-safe-setup as a separate `closeout-narrative-fixtures/` directory and cross-reference.
- Not asking you to depend on cc-safe-setup or vice versa.
- Not asking for a hook adapter in either direction (the event-level vs text-level distinction makes a clean adapter hard).
- Not proposing schema overlap with the receipt-persistence layer (PR #282/#283/#286/#298 on cc-safe-setup — that's a separate substrate for the dispatch axis and shouldn't pollute your closeout corpus).
A short cross-reference section in each repo's README naming the other and the boundary between them. AgentCloseoutBench owns the closeout-text axis; cc-safe-setup's sub-agent cluster owns the dispatch-event axis. Operators install both for full surface coverage. The README sections cite the OpenClaw case (#61167) as the canonical example of an incident that exposes both axes.
If you want a fixture pack PR, I'll draft and send. If you'd rather keep the corpus shape stable for v0.3, the README cross-reference alone is enough from my side.
Related artifacts on the cc-safe-setup side, for context:
- Four sub-pattern meta-analysis Gist (English, 2,270 words, MIT): https://gist.github.com/yurukusa/9857a9ed407696ba8483b354917ff161
- Sub-Agent Observability wiki page: https://github.com/yurukusa/cc-safe-setup/wiki/Sub-Agent-Observability
- Operator log of the 72-hour build (English, 1,571 words, MIT): https://gist.github.com/yurukusa/f98ab28bc11ea1c54a12c732dec857d5
— yurukusa


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjacent axis: dispatch-boundary deception in the May 2026 sub-agent failure cluster (cc-safe-setup integration questions) #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Axis	Where it fires	What it measures	Detection surface
AgentCloseoutBench (your work)	Closeout boundary (final assistant message)	Deception in the assistant's text output	NLP/regex on text
cc-safe-setup sub-agent cluster (my work)	Dispatch boundary (PreToolUse-Agent and adjacent)	Divergence between dispatch claims and session log	Event-level (tool-call counts, receipts, allowlists)
The May 2026 GitHub issue cluster on `anthropics/claude-code` (#60987, #61102, #61107, #61167, #61315, #61405, #61547) consolidates into four sub-patterns at the dispatch boundary: dispatch fabrication (claim of completion with zero tool calls), silent stall (sub-agent blocks on hidden condition, blocked-state not propagated), absence of observation and control (12-hour silent hang), and scope expansion (sub-agent output treated as authorization rather than evidence).
The cleanest case in the cluster — #61167, nvst18's OpenClaw trauma therapy deployment — reads as a `sycophancy`-flavored failure at AgentCloseoutBench's level (Claude reported success with confidence) and a dispatch-fabrication failure at the cc-safe-setup level (5 verification agents returning success with 0 sessions per agent). The same incident exposes both axes.
1. Is `sycophancy` defined to include dispatch-claim sycophancy, or only conversational flattery? From `DATASET_CARD.md` it reads as the latter (operator-facing flattery about the user), but the boundary feels fuzzy. The closeout-side message "I successfully dispatched 39 specialized agents" (when 5 ran) is dispatch-fabrication operationally and sycophancy semantically. Two ways to read this:

Adjacent axis: dispatch-boundary deception in the May 2026 sub-agent failure cluster (cc-safe-setup integration questions) #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions