Skip to content

Commit 9ce9e10

Browse files
garrytanclaude
andcommitted
test: spell out AskUserQuestion everywhere instead of AUQ
Per user feedback: don't shorten AskUserQuestion to AUQ — the abbreviation reads as cryptic. Apply across all the new code from this branch: - Rename test/skill-e2e-auq-format-compliance.test.ts → test/skill-e2e-ask-user-question-format-compliance.test.ts - Touchfile entry auq-format-pty → ask-user-question-format-pty (touchfiles.ts + matching assertion in touchfiles.test.ts) - Function rename navigateToModeAuq → navigateToModeAskUserQuestion - Variable auqVisible → askUserQuestionVisible - Outcome literal 'real_auq' → 'real_question' - All comments + JSDoc + CHANGELOG entry write AskUserQuestion in full - "AUQs" plural → "AskUserQuestions" No behavior change. 49/49 free tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 43de088 commit 9ce9e10

9 files changed

Lines changed: 53 additions & 53 deletions

CHANGELOG.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
## [1.15.0.0] - 2026-04-26
44

5-
## **Skill prompts get a 25% haircut. Plan-mode E2E coverage doubles, and AUQ rendering is now testable.**
5+
## **Skill prompts get a 25% haircut. Plan-mode E2E coverage doubles, and AskUserQuestion rendering is now testable.**
66

7-
Three pieces of work in one release. First, every preamble resolver got compressed: 18 resolvers (Voice, Writing Style, AskUserQuestion Format, Completeness Principle, Plan Mode Info, Brain Sync, Routing Injection, and 11 more) lost a third of their prose without losing a single semantic rule. The full corpus of generated `SKILL.md` files dropped from 3.08 MB to 2.30 MB across 47 outputs. Second, the 5 plan-mode E2E tests added in v1.11.1.0 and rewritten in v1.12.1.0 turned out to have never actually passed — the SDK harness they used couldn't observe Claude's plan-mode confirmation UI. This release ships a real-PTY harness that drives the actual `claude` binary, watches the rendered terminal, and gets all 5 to green. Third, on top of that harness, 6 new E2E tests cover behaviors no test could reach before: AUQ format compliance, plan-design UI-scope detection (positive path), tool-budget regression, /ship idempotency end-to-end, /plan-ceo answer-routing, and /autoplan phase ordering.
7+
Three pieces of work in one release. First, every preamble resolver got compressed: 18 resolvers (Voice, Writing Style, AskUserQuestion Format, Completeness Principle, Plan Mode Info, Brain Sync, Routing Injection, and 11 more) lost a third of their prose without losing a single semantic rule. The full corpus of generated `SKILL.md` files dropped from 3.08 MB to 2.30 MB across 47 outputs. Second, the 5 plan-mode E2E tests added in v1.11.1.0 and rewritten in v1.12.1.0 turned out to have never actually passed — the SDK harness they used couldn't observe Claude's plan-mode confirmation UI. This release ships a real-PTY harness that drives the actual `claude` binary, watches the rendered terminal, and gets all 5 to green. Third, on top of that harness, 6 new E2E tests cover behaviors no test could reach before: AskUserQuestion format compliance, plan-design UI-scope detection (positive path), tool-budget regression, /ship idempotency end-to-end, /plan-ceo answer-routing, and /autoplan phase ordering.
88

99
### The numbers that matter
1010

@@ -18,7 +18,7 @@ Token-level reduction comes from regenerating every `SKILL.md` against the slim
1818
| Plan-mode E2E tests passing | 0/5 | 5/5 | +5 |
1919
| Plan-mode E2E wall time | ∞ (never green) | 790 s (sequential) | proven |
2020
| Real-PTY E2E test count | 5 | 11 | +6 |
21-
| Gate-tier paid E2E added | 0 | 3 | auq-format, design-with-ui, budget-regression |
21+
| Gate-tier paid E2E added | 0 | 3 | ask-user-question-format, design-with-ui, budget-regression |
2222
| Periodic-tier paid E2E added | 0 | 3 | mode-routing, ship-idempotency, autoplan-chain |
2323
| New helper unit tests | 0 | 23 | parser + budget regression coverage |
2424

@@ -31,7 +31,7 @@ The biggest wins are the tier-≥3 plan reviews that load full preamble surface
3131

3232
### What this means for builders
3333

34-
Faster every-skill startup, cheaper prompt-cache pricing on cold reads, more headroom inside the 200K context window for actual work. The plan-mode E2E tests now actually verify the skill doesn't silently write a plan file when `/plan-ceo-review` runs in plan mode. And the 3 new gate-tier tests catch a class of regression that was previously invisible: AUQ format drift (`Recommendation:` line missing), UI-scope misdetection (positive path), and tool-call budget bloat (a skill burning 3× the tools it used to). Run `bun run gen:skill-docs --host all` after pulling. The 11 plan-mode tests will run in CI on the next gate-tier eval pass.
34+
Faster every-skill startup, cheaper prompt-cache pricing on cold reads, more headroom inside the 200K context window for actual work. The plan-mode E2E tests now actually verify the skill doesn't silently write a plan file when `/plan-ceo-review` runs in plan mode. And the 3 new gate-tier tests catch a class of regression that was previously invisible: AskUserQuestion format drift (`Recommendation:` line missing), UI-scope misdetection (positive path), and tool-call budget bloat (a skill burning 3× the tools it used to). Run `bun run gen:skill-docs --host all` after pulling. The 11 plan-mode tests will run in CI on the next gate-tier eval pass.
3535

3636
### Itemized changes
3737

@@ -41,10 +41,10 @@ Faster every-skill startup, cheaper prompt-cache pricing on cold reads, more hea
4141
- `parseNumberedOptions(visible)` and `isPermissionDialogVisible(visible)` helpers in `claude-pty-runner.ts`. Tests can now look up an option index by its label without hard-coding positions, and auto-grant Claude Code's file-edit / workspace-trust / bash-permission dialogs that fire during preamble side-effects.
4242
- `findBudgetRegressions()` and `assertNoBudgetRegression()` in `test/helpers/eval-store.ts`. Pure functions returning tests that grew >2× in tools or turns vs the prior eval run, with floors at 5 prior tools / 3 prior turns to avoid noise. Env override `GSTACK_BUDGET_RATIO`.
4343
- 6 new real-PTY E2E tests on the harness:
44-
- `skill-e2e-auq-format-compliance.test.ts` (gate, ~$0.50/run): asserts every gstack `AskUserQuestion` rendering contains the 7 mandated format elements (ELI10, Recommendation, Pros/Cons with ✅/❌, Net, `(recommended)` label).
44+
- `skill-e2e-ask-user-question-format-compliance.test.ts` (gate, ~$0.50/run): asserts every gstack `AskUserQuestion` rendering contains the 7 mandated format elements (ELI10, Recommendation, Pros/Cons with ✅/❌, Net, `(recommended)` label).
4545
- `skill-e2e-plan-design-with-ui.test.ts` (gate, ~$0.80/run): positive coverage for `/plan-design-review` UI-scope detection. Counterpart to the existing no-UI early-exit test — without it, a regression that flips the detector to "early-exit always" would ship undetected.
4646
- `skill-budget-regression.test.ts` (gate, free): branch-scoped library-only assertion that no skill burns >2× tools or turns vs its prior recorded run.
47-
- `skill-e2e-plan-ceo-mode-routing.test.ts` (periodic, ~$3/run): verifies AUQ answer routing — HOLD SCOPE picks routes to rigor language, SCOPE EXPANSION picks route to expansion language.
47+
- `skill-e2e-plan-ceo-mode-routing.test.ts` (periodic, ~$3/run): verifies AskUserQuestion answer routing — HOLD SCOPE picks routes to rigor language, SCOPE EXPANSION picks route to expansion language.
4848
- `skill-e2e-ship-idempotency.test.ts` (periodic, ~$3/run): runs `/ship` end-to-end against a real git fixture with `STATE: ALREADY_BUMPED` baked in; asserts no double-bump, no double-commit, no fixture mutation.
4949
- `skill-e2e-autoplan-chain.test.ts` (periodic, ~$8/run): asserts `/autoplan` phase ordering by tee'ing timestamps as each `**Phase N complete.**` marker appears.
5050
- `test/helpers-unit.test.ts`: 23 unit tests covering `parseNumberedOptions` edge cases (empty, partial paint, >9 options, stale-vs-fresh anchoring) and `findBudgetRegressions` (noise floor, env override, missing tool data).
@@ -72,7 +72,7 @@ Faster every-skill startup, cheaper prompt-cache pricing on cold reads, more hea
7272

7373
#### For contributors
7474

75-
- `test/helpers/touchfiles.ts`: 5 plan-mode test selections + e2e-harness-audit selection now point at `claude-pty-runner.ts` instead of the deleted helper. 6 new entries (`auq-format-pty`, `plan-ceo-mode-routing`, `plan-design-with-ui-scope`, `budget-regression-pty`, `ship-idempotency-pty`, `autoplan-chain-pty`) with tier classifications: 3 gate, 3 periodic.
75+
- `test/helpers/touchfiles.ts`: 5 plan-mode test selections + e2e-harness-audit selection now point at `claude-pty-runner.ts` instead of the deleted helper. 6 new entries (`ask-user-question-format-pty`, `plan-ceo-mode-routing`, `plan-design-with-ui-scope`, `budget-regression-pty`, `ship-idempotency-pty`, `autoplan-chain-pty`) with tier classifications: 3 gate, 3 periodic.
7676
- `test/e2e-harness-audit.test.ts`: recognizes `runPlanSkillObservation` as a valid coverage path alongside the legacy `canUseTool` / `runPlanModeSkillTest` patterns.
7777
- New unit test: `test/gen-skill-docs.test.ts` asserts plan-review preambles stay under 33 KB and the slim Voice section preserves its load-bearing semantic contract (lead-with-the-point, name-the-file, user-outcome framing, no-corporate, no-AI-vocab, user-sovereignty).
7878
- `test/touchfiles.test.ts`: skill-specific change selection count updated 15 → 18 to match the 6 new touchfile entries that depend on `plan-ceo-review/**`.

test/helpers-unit.test.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
*
44
* - parseNumberedOptions(visible)
55
* Parses `❯ 1.` / ` 2.` numbered-option lines out of TTY text.
6-
* Used by the AUQ format-compliance and mode-routing tests to look
6+
* Used by the AskUserQuestion format-compliance and mode-routing tests to look
77
* up an option index by its label without hard-coding positions.
88
*
99
* - findBudgetRegressions / assertNoBudgetRegression(comparison)
@@ -117,7 +117,7 @@ describe('parseNumberedOptions', () => {
117117

118118
test('anchors on LAST cursor when both stale and fresh fit in the tail', () => {
119119
// Both lists fit in the same 4KB tail (small buffer). The granted
120-
// permission dialog options come first, the real AUQ comes second.
120+
// permission dialog options come first, the real AskUserQuestion comes second.
121121
// We must return the FRESH options, not the STALE ones.
122122
const visible = [
123123
'❯ 1. STALE_grant',

test/helpers/claude-pty-runner.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ export function isPlanReadyVisible(visible: string): boolean {
143143
* option list (so isNumberedOptionListVisible matches them) but they
144144
* are NOT a skill's AskUserQuestion — they're claude asking the user
145145
* whether to grant a tool/file permission. Tests that look for skill
146-
* AUQs must explicitly skip these.
146+
* AskUserQuestions must explicitly skip these.
147147
*
148148
* Both English phrases below are stable across recent Claude Code
149149
* versions. The check is permissive on whitespace because TTY rendering
@@ -206,13 +206,13 @@ export function parseNumberedOptions(
206206
// visually reads "1. Option" can come through as "1.Option".
207207
const optionRe = /^[\s]*([1-9])\.\s*(\S.*?)\s*$/;
208208
// We anchor on the LATEST `❯ 1.` line in the buffer — the cursor marker
209-
// for the active AUQ. Older numbered lists (e.g., a granted permission
209+
// for the active AskUserQuestion. Older numbered lists (e.g., a granted permission
210210
// dialog still in scrollback) sit above it and must be ignored. Without
211211
// this, parseNumberedOptions returns stale options after the dialog is
212212
// dismissed.
213213
const lines = tail.split('\n');
214214
// Anchor on the LAST `❯ 1.` line (cursor is on option 1 of the active
215-
// AUQ). Greedy character classes don't help here — we need a literal
215+
// AskUserQuestion). Greedy character classes don't help here — we need a literal
216216
// `❯` after optional leading whitespace.
217217
let cursorLineIdx = -1;
218218
for (let i = lines.length - 1; i >= 0; i--) {

test/helpers/touchfiles.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
9696
// Real-PTY E2E batch (#6 new tests on the harness).
9797
// Each one tests behavior the SDK harness can't observe (rendered TTY,
9898
// numbered-option lists, multi-phase ordering, idempotency state echo).
99-
'auq-format-pty': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completeness-section.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
99+
'ask-user-question-format-pty': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble/generate-completeness-section.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
100100
'plan-ceo-mode-routing': ['plan-ceo-review/**', 'scripts/resolvers/preamble/generate-ask-user-format.ts', 'scripts/resolvers/preamble.ts', 'test/helpers/claude-pty-runner.ts'],
101101
'plan-design-with-ui-scope': ['plan-design-review/**', 'test/fixtures/plans/ui-heavy-feature.md', 'test/helpers/claude-pty-runner.ts'],
102102
'budget-regression-pty': ['test/helpers/eval-store.ts', 'test/skill-budget-regression.test.ts'],
@@ -351,8 +351,8 @@ export const E2E_TIERS: Record<string, 'gate' | 'periodic'> = {
351351
// Real-PTY E2E batch — tier classification:
352352
// gate: cheap, deterministic, run on every PR
353353
// periodic: long-running or expensive (>$3/run), run weekly
354-
'auq-format-pty': 'gate', // ~$0.50/run, single skill probe
355-
'plan-ceo-mode-routing': 'periodic', // ~$3/run, deep navigation through 8-12 prior AUQs
354+
'ask-user-question-format-pty': 'gate', // ~$0.50/run, single skill probe
355+
'plan-ceo-mode-routing': 'periodic', // ~$3/run, deep navigation through 8-12 prior AskUserQuestions
356356
'plan-design-with-ui-scope': 'gate', // ~$0.80/run
357357
'budget-regression-pty': 'gate', // free, library-only assertion
358358
'ship-idempotency-pty': 'periodic', // ~$3/run, real /ship in plan mode

test/skill-e2e-auq-format-compliance.test.ts renamed to test/skill-e2e-ask-user-question-format-compliance.test.ts

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,12 @@
1616
* Why real-PTY: the existing skill-e2e-plan-format tests cover what the
1717
* AGENT writes via the SDK (capture-to-file harness). This test covers
1818
* what the USER actually sees in the terminal — different bug class
19-
* (e.g., AUQ tool truncates long prose, conductor renderer mangles
19+
* (e.g., AskUserQuestion tool truncates long prose, conductor renderer mangles
2020
* bullets, model collapses sections under token pressure). Two layers
2121
* of defense for a format-discipline regression that previously ate ~6
2222
* weeks of compliance drift before it was noticed.
2323
*
24-
* Trigger choice: /plan-ceo-review fires its mode-selection AUQ
24+
* Trigger choice: /plan-ceo-review fires its mode-selection AskUserQuestion
2525
* deterministically and early (Step 0F), so we don't need to drive
2626
* through any prior questions to reach a format check.
2727
*
@@ -69,7 +69,7 @@ function findFormatGaps(visible: string): FormatGap[] {
6969

7070
describeE2E('AskUserQuestion format compliance (gate)', () => {
7171
test(
72-
'first AUQ from /plan-ceo-review contains all 7 mandated format elements',
72+
'first AskUserQuestion from /plan-ceo-review contains all 7 mandated format elements',
7373
async () => {
7474
const session = await launchClaudePty({
7575
permissionMode: 'plan',
@@ -82,10 +82,10 @@ describeE2E('AskUserQuestion format compliance (gate)', () => {
8282
const since = session.mark();
8383
session.send('/plan-ceo-review\r');
8484

85-
// Wait for a SKILL AUQ. Strategy: poll the visible buffer until it
85+
// Wait for a SKILL AskUserQuestion. Strategy: poll the visible buffer until it
8686
// contains both a numbered-option list AND the format markers we
8787
// expect (ELI10 + Recommendation). When both are present, it IS a
88-
// real format-compliant AUQ — not a permission dialog or trust
88+
// real format-compliant AskUserQuestion — not a permission dialog or trust
8989
// prompt.
9090
//
9191
// While polling, auto-grant any permission dialogs we see in the
@@ -94,7 +94,7 @@ describeE2E('AskUserQuestion format compliance (gate)', () => {
9494
const budgetMs = 300_000;
9595
const start = Date.now();
9696
let captured = '';
97-
let auqVisible = false;
97+
let askUserQuestionVisible = false;
9898
let lastPermSig = '';
9999
// Snapshot debug counters every poll so the timeout error shows
100100
// WHY we never matched (cursor-found vs markers-found discrepancy).
@@ -106,20 +106,20 @@ describeE2E('AskUserQuestion format compliance (gate)', () => {
106106
await Bun.sleep(2000);
107107
if (session.exited()) {
108108
throw new Error(
109-
`claude exited (code=${session.exitCode()}) before AUQ rendered.\n` +
109+
`claude exited (code=${session.exitCode()}) before AskUserQuestion rendered.\n` +
110110
`Last visible:\n${session.visibleSince(since).slice(-2000)}`,
111111
);
112112
}
113113
const visible = session.visibleSince(since);
114114
// Marker check: anywhere in the post-slash region. Since `since`
115115
// is set right after sending /plan-ceo-review, there's no stale
116-
// AUQ above this line — the only AUQ that can produce these
116+
// AskUserQuestion above this line — the only AskUserQuestion that can produce these
117117
// markers is the current one.
118118
const hasEli10 = /ELI10\s*:/i.test(visible);
119119
const hasRecommend = /Recommendation\s*:/i.test(visible);
120120

121121
// Cursor check: a numbered option list near the bottom of the
122-
// buffer means the AUQ is currently rendered (not scrolled away).
122+
// buffer means the AskUserQuestion is currently rendered (not scrolled away).
123123
const cursorTail = visible.slice(-4000);
124124
const hasCursor = isNumberedOptionListVisible(cursorTail) &&
125125
parseNumberedOptions(cursorTail).length >= 2;
@@ -129,7 +129,7 @@ describeE2E('AskUserQuestion format compliance (gate)', () => {
129129

130130
// Permission dialog branch: grant once per unique rendering, but
131131
// only when we don't already have format markers visible (so we
132-
// don't accidentally grant a permission inside a real AUQ).
132+
// don't accidentally grant a permission inside a real AskUserQuestion).
133133
if (
134134
hasCursor &&
135135
!(hasEli10 && hasRecommend) &&
@@ -144,18 +144,18 @@ describeE2E('AskUserQuestion format compliance (gate)', () => {
144144
}
145145
}
146146

147-
// Real AUQ check: cursor visible AND markers present anywhere in
147+
// Real AskUserQuestion check: cursor visible AND markers present anywhere in
148148
// the post-slash region.
149149
if (hasCursor && hasEli10 && hasRecommend) {
150150
debugBothSeen++;
151151
captured = visible;
152-
auqVisible = true;
152+
askUserQuestionVisible = true;
153153
break;
154154
}
155155
}
156-
if (!auqVisible) {
156+
if (!askUserQuestionVisible) {
157157
throw new Error(
158-
`AUQ not rendered within ${budgetMs}ms.\n` +
158+
`AskUserQuestion not rendered within ${budgetMs}ms.\n` +
159159
`Debug counts: cursorSeen=${debugCursorSeen} markersSeen=${debugMarkersSeen} bothSeen=${debugBothSeen}\n` +
160160
`Last visible (4KB):\n${session.visibleSince(since).slice(-4000)}`,
161161
);
@@ -165,7 +165,7 @@ describeE2E('AskUserQuestion format compliance (gate)', () => {
165165
// Surface the captured text last 3KB on failure for debugging.
166166
const tail = captured.slice(-3000);
167167
throw new Error(
168-
`AUQ format compliance FAILED — missing ${gaps.length} mandated field(s):\n` +
168+
`AskUserQuestion format compliance FAILED — missing ${gaps.length} mandated field(s):\n` +
169169
gaps.map(g => ` - ${g.field} (regex: ${g.re.source})`).join('\n') +
170170
`\n--- captured (last 3KB) ---\n${tail}`,
171171
);

test/skill-e2e-autoplan-chain.test.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ describeE2E('/autoplan chain ordering (periodic)', () => {
9999
const visible = session.visibleSince(since);
100100

101101
// Auto-grant any permission dialog so autoplan can keep moving
102-
// through its phases. The autoplan template auto-decides AUQs
102+
// through its phases. The autoplan template auto-decides AskUserQuestions
103103
// it owns; only permission prompts (file/tool grants) need our
104104
// hand-pressing. Classify on tail to avoid stale matches.
105105
const recentTail = visible.slice(-1500);

0 commit comments

Comments
 (0)