Skip to content

Commit 2ca8539

Browse files
AlexanderMakarovAleksandr Makarov
andauthored
test: three-layer static safety net (lint + installer + fixtures) (#117)
* test: add static prompt linter (Layer 1) First of three test layers from the prompt-update safety-net plan. The linter exercises wrapper/root command symmetry, dimension DAG validity, frontmatter schema, slash-command cross-references, agent-marker presence, and setup-config-to-source-tree consistency. Pure Node built-ins (node:test, node:fs, node:path) plus a hand-rolled YAML frontmatter parser — no npm dependencies, runs under both node --test and bun test. Two tests use a portable pending() helper that stays green until their gated contracts land: - F18 (wrapper description must match root) — hire.md has known drift today; the lint flips to a hard failure once F18 lands. - F5 (implement.md XML verification snippets) — F5 hasn't shipped yet; the lint flips on automatically when implement.md grows <verification_commands>, <scope_discipline>, <investigate_before_answering>. * test: add installer unit tests (Layer 2) Exercises the installer services against real fs.mkdtemp() directories, no mocks. Covers: - file-copier: fresh install, new-file auto-discovery, dry-run honesty, and a pinned assertion of the current wrapper-overwrite behaviour. The wrapper-overwrite test carries a comment referencing plan §11's open question (docs call wrappers user-editable, code overwrites them) so a future PR which intentionally changes the policy fails loudly and forces a docs/code reconciliation. - migration-runner: end-to-end migration chain, skip_if_any semantics, dry-run safety, and a meta-check that version numbers are sequential. Includes a non-blocking F12 placeholder for the wrapper-rewrite migration's tests. - setup-orchestrator: six-step pipeline against a temp dir, idempotency on re-run, and dry-run produces zero on-disk files. A silenced() helper muzzles the installer's progress logging during tests. All built-ins; no npm dependencies; runs under both node --test and bun test. * test: add fixture-driven end-to-end tests (Layer 3) Each fixture under tests/fixtures/<name>/ is a small directory tree representing a real-world install scenario. The harness copies the before/ subtree into a fresh fs.mkdtemp() dir, runs the full installer pipeline, then asserts the result against expected-after.json — a manifest of { exists, contains, notContains, unchanged, sha256 } per relative path. Files not listed in the manifest are not asserted, keeping fixtures selective and short. v1 fixtures shipped: - fresh-project: empty before/, asserts the canonical post-install layout including .awos/.../*, .claude/commands/awos/*, .mcp.json, and .claude/settings.json. - existing-awos-v0: pre-existing stale .awos/commands/architecture.md, asserts the framework overwrites it with current content. - customized-wrapper: pre-existing wrapper with custom user text under .claude/commands/awos/, asserts the CURRENT overwrite behaviour. The expected-after.json carries a comment referencing plan §11's open question so a future PR which intentionally preserves wrappers fails loudly and is forced to reconcile docs vs code together. - mid-workflow: pre-populated context/spec/001-test-feature/* — asserts the installer touches none of them (unchanged: true on each file). - pre-migration-v1: agent at the v0 path; after migrations 001 then 002 run, asserts the final tree (.awos/.migration-version == 2 and the agent directory has been cleaned up). * chore: wire test scripts and non-blocking CI job - package.json: replace the placeholder test script with node --test variants for each layer (no npm deps; Node 22+ has node:test built-in; Bun runs the same tests via bun test for local cross-runtime checks). - .github/workflows/quality-check.yml: add a 'test' job alongside prettier on Node 22, marked continue-on-error: true so it surfaces failures without blocking PRs. Flip to required after two consecutive green runs (plan PR 4). - CLAUDE.md: add a Testing section documenting the three layers and the rule that audit-driven follow-ups must ship their lint rule or fixture in the same PR as their structural change. - CONTRIBUTING.md: short pointer at npm test (primary) and bun test (local cross-runtime sanity check), plus how to add a fixture. * test: harden F18/F5 checks now that audit contracts landed; drop F12 placeholder After rebasing onto feat/anthropic-best-practices-alignment, the three contracts these placeholders were waiting on are decided: - F18 (wrapper description matches root): wrappers now mirror their root command's description. Removes the soft pending() fallback so any future drift hard-fails the suite. - F5 (XML verification snippets in implement.md): the three required tags (<verification_commands>, <scope_discipline>, <investigate_before_answering>) are present. Same: hard assertion. - F12 (wrapper-rewrite migration): F12 shipped as a direct source rewrite, not a migration. Standard always-overwrite copy semantics deliver the new @-import wrappers; no migration ever coming. Removes the obsolete placeholder. Drops the now-unused pending() helper. * docs(tests): add tests/README.md explaining strategy and commands Documents the three-layer safety net (lint / installer / fixtures), the Run commands (npm test primary, bun test for cross-runtime sanity), the rule for adding tests when introducing new structural contracts, and the hard constraints (no npm deps, cross-runtime compatible). * fix(test): use glob pattern for node --test on Node 22 `node --test tests/` resolves `tests` as a module entry point, not a directory to scan recursively, which fails with MODULE_NOT_FOUND on CI. Node 22's --test supports glob patterns natively when the pattern is single-quoted so the shell doesn't pre-expand it. * test(lint): assert subagent-enumerating commands point at .claude/agents/ Pins the F3 follow-up fix. architecture.md, tasks.md, tech.md, and hire.md all need a definitive list of registered specialist subagents (for coverage tables and **[Agent: name]** task assignments) — auto- dispatch metadata is not a substitute. The documented enumeration path per Anthropic's sub-agents docs is the filesystem at .claude/agents/*.md, parsed for YAML frontmatter. This test fails fast if a future edit reverts to "just trust auto- dispatch" phrasing without naming the discovery source. * test(lint): drop F5 verification-snippet check, keep F8/F9 checks Follows F5 removal on feat/anthropic-best-practices-alignment. implement.md no longer ships a <verification_commands> block by default, so this lint can't assert it. F8 <scope_discipline> and F9 <investigate_before_answering> remain — they're separate audit findings and still part of the formulated subagent prompt. The renamed test "implement.md uses XML scope-and-investigate snippets" makes the narrowed scope explicit. * test(e2e): add session-log parser and assertion DSL (Layer 4 foundation) Adds tests/e2e/session-reader.js to parse Claude Code JSONL session logs from ~/.claude/projects/<encoded-cwd>/, and tests/e2e/expect.js with a small assertion DSL (expectToolCall / expectNoToolCall / expectFileExists) for scenario authors. A hand-crafted fixture under tests/e2e/fixtures/ exercises every event-type shape the parser cares about, and 6 node:test cases pin the contract — making the parser testable without needing a real Claude session. No npm deps; cross-runtime (passes under both `node --test` and `bun test`). This is the foundation for the human-triggered E2E scenarios that follow. * test(e2e): add prepare/verify CLIs and tasks-enumerates-agents scenario Adds bin/awos-e2e-prepare.js (spins a temp project, runs the real installer, overlays scenario fixture, stamps a prepare-time so verify can filter session logs) and bin/awos-e2e-verify.js (locates the right JSONL under ~/.claude/projects/, parses it, runs the scenario's assert.js, reports pass/fail with the recent tool-call trace). Ships the first scenario — tasks-enumerates-agents — which validates the F3 follow-up: /awos:tasks must scan .claude/agents/ rather than guess specialists. The assertion is a tolerant union (Glob/Read/LS/Grep against the agents path, or an Agent/Explore delegation mentioning it counts) so the contract holds regardless of which discovery shape Claude picks. It also reconciles every **[Agent: name]** marker in tasks.md against an actual agent file in the fixture, catching hallucinated specialists. Wires e2e:prepare, e2e:verify, and test:e2e npm scripts. The scenario runs are human-triggered and stay out of CI; the parser unit test already runs through `npm test` via the existing glob. * docs(tests): document Layer 4 session-log E2E in tests/README.md Adds a new "Layer 4 — Session-log E2E (human-triggered)" section after Layer 3 explaining what the harness is, why it complements static lint (catches "Claude doesn't follow the prompt" — the gap the source-only lint cannot close), how to run prepare/verify, and how to add a new scenario. Notes explicitly that scenario runs are NOT a CI gate; only the parser unit test runs through `npm test`. * fix(e2e): resolve symlinks and encode underscores in cwd→project mapping Two bugs in encodeCwd broke verify on macOS: 1. macOS /var/folders/... is a symlink to /private/var/folders/... Claude Code records the canonical (realpath) form for the session's project directory, so the encoder must too. Added fs.realpathSync with a fallback for non-existent paths (synthetic test inputs). 2. Claude Code converts BOTH `/` and `_` to `-` when forming the project directory name. The old encoder only handled `/`, so `_x/` in the temp path resolved to the wrong session directory. `/private/var/folders/_x/.../foo` now correctly encodes to `-private-var-folders--x-...-foo` (note the `--x-` double dash). Verified end-to-end against a real Claude Code session: the verify command now finds the session log and reports [pass]. * refactor(e2e): narrate each check in verify output The verify harness was reporting "[pass] X events, Y tool calls" — which says the suite ran, not what was actually validated. Recasts the scenario as a sequence of named `check(description, fn)` calls and streams a pass/fail line per check. New API: expect.makeChecker(report) returns a check() helper. The verify CLI builds a report with pass/fail callbacks that write `✓` and `✗` lines, then passes `check` into the scenario's run({...}). Scenarios call `await check('what was verified', () => { ... })`. The tasks-enumerates-agents scenario now narrates 7 distinct checks: discovery, per-agent read evidence, output file written, marker presence for each seeded agent, and no hallucinated agent names. Output now reads like a verification log instead of an inventory. * docs(claude.md): document four-layer test model and narration rules Adds Layer 4 (session-log E2E) to the testing section and codifies two rules that future PRs (human or agent-authored) need to inherit: 1. Static vs. behavioral coverage. Layers 1–3 verify source-file wiring; only Layer 4 verifies that Claude actually follows the wiring at runtime. Layer 1 can assert that commands/tasks.md mentions .claude/agents/; only Layer 4 proves Claude opened it. 2. Tests must narrate what they checked. E2E scenarios use `await check('what was verified', () => { ... })` so each assertion becomes a streamed ✓/✗ line. Lint failures must name the contract, not just dump a diff. "N events found" is a summary, not a verification log. Common Commands section grows entries for `npm run test:e2e`, `e2e:prepare`, and `e2e:verify`. Extends the audit-followup rule to cover behavioral contracts: "Claude must call X" goes to a Layer-4 scenario. * feat(e2e): friendlier UX + de-historicize testing rules in CLAUDE.md Ergonomics: - `bun run e2e` / `e2e:list` prints all scenarios with descriptions parsed from each INSTRUCTIONS.md's first non-heading line. - `bun run e2e:prepare <scenario>` now records the workdir + scenario at tests/e2e/.last-run.json (gitignored). - `bun run e2e:verify` with no args resumes the last prepare; with a scenario arg, infers the matching workdir from state. Three positional forms supported: (), (scenario), (scenario, workdir). - prepare's output now spells out the cd → claude → verify steps so the next move is copy-paste obvious. CLAUDE.md: - Replace the layer-by-layer audit-followup pep talk with a "lowest layer that expresses the contract" table. - Drop references to specific scenarios and example assertion strings; the rule is the rule, not a snapshot of today's tests. - Inventory of current Layer-4 scenarios lives in tests/README.md, not in always-loaded project memory. * test(e2e): add implement-orchestrator-only scenario Asserts /awos:implement behaves as a dispatcher: it delegates the actual coding via the Agent tool and does not call Edit/Write/MultiEdit on source files itself. The contract covered (commands/implement.md + CLAUDE.md): - at least one Agent/Task call carries a subagent_type - no Edit/Write/MultiEdit on anything other than tasks.md (checkbox flips are bookkeeping and allowed) - the delegation prompt carries <verification_commands> or the concrete pytest command from tasks.md - the prompt also carries <scope_discipline> and <investigate_before_answering> (F5 guards) - the **[Agent: python-expert]** marker survives in tasks.md * test(e2e): add architecture-builds-coverage-table scenario Asserts /awos:architecture scans .claude/agents/, reads its product-definition + roadmap prerequisites, and writes a coverage table that reflects reality. The fixture seeds only python-expert (no react-expert) over a product that needs both halves, so a correctly-following Claude must emit a table with at least one ✅ Exists row (Python) and at least one ⚠️ Missing row (React/frontend). Same tolerant discovery union as tasks-enumerates-agents — Glob/Read/LS/Grep or Agent/Explore delegation all count as proof. * test(e2e): add tech-uses-parallel-reads-and-explore scenario Asserts /awos:tech follows the parallel-reads + Explore-delegation contract from commands/tech.md Step 2: - Reads functional-spec.md and architecture.md in parallel (proven by shared assistantUuid — calls emitted in one assistant turn) - Scans .claude/agents/ for specialists - Delegates codebase analysis to a subagent whose subagent_type matches /Explore/i - Writes technical-considerations.md at the spec path The parallel-tool-call check leans on the session-reader's existing assistantUuid field — no parser changes needed. * test(e2e): add verify-runs-real-verification scenario Asserts /awos:verify honours the F5 contract — it must actually run a verification check before flipping Status to Completed, not just reason textually about the spec. The fixture pre-installs src/health.py and tests/test_health.py so the criteria are true. The assertion uses a tolerant union over the mechanisms commands/verify.md Step 3 lists: - a real Bash run of pytest/python/node/curl/etc. - a Read on the implementation artifact (src/health.py) - any Playwright MCP call That mirrors the prompt's "pick the check that fits the criterion type" wording without locking the test to one mechanism, then asserts the spec file now contains Status: Completed. * docs(tests): document four new Layer-4 scenarios Expand the Layer-4 scenarios table to cover the new implement-orchestrator-only, architecture-builds-coverage-table, tech-uses-parallel-reads-and-explore, and verify-runs-real-verification scenarios. Columns now include the target slash command and the contract type each one exercises so future authors have a reference for each pattern (discovery + output, negative + delegation, output + table semantics, parallel calls, observable verification). Also restructure the "Adding a new scenario" section as a numbered recipe and call out the check() wrapping requirement explicitly. * test(e2e): drop F5 verify scenario and verification checks from implement F5 (verification-required behavior) was reverted on the audit branch. This removes the corresponding test coverage: - Delete tests/e2e/scenarios/verify-runs-real-verification/ entirely. The contract it tested no longer exists; /awos:verify is back to textual reasoning over acceptance criteria. - In implement-orchestrator-only/assert.js, drop the "Claude passed verification commands to the subagent" check. Keep the F8 <scope_discipline> + F9 <investigate_before_answering> guard check — those are separate audit findings still in force. - tests/README.md table loses the verify row; the implement row is reworded to mention F8/F9 instead of F5. The lint-level companion change (drop F5 substring check) landed in the same commit on tests/safety-net. * ui(e2e): drop duplicated prepare output, merge claude+prompt step prepare.js used to render INSTRUCTIONS.md and then re-print the workdir + a "Next steps" block + three verify-command variants — all of which the INSTRUCTIONS already cover. Drops everything after the rendered INSTRUCTIONS. Across all four scenarios, the Steps section collapses from 1. cd <workdir> 2. claude 3. Type: /awos:command args 4. ... wait for completion ... 5. bun run e2e:verify ... into a single one-shot invocation that opens claude with the prompt preloaded: 1. cd <workdir> && claude "/awos:command args" 2. ... wait for completion ... 3. bun run e2e:verify Also switches the post-run command from `npm run e2e:verify <name> <dir>` to bare `bun run e2e:verify` (which resumes the last prepare via the state file added in 2392b75). * test(e2e): loosen tech scenario assertions to match practical contract Two strict assertions caused false-negative failures on real Claude runs where the practical contract was met: 1. "Reads issued in parallel" required both Reads to share an assistantUuid (same assistant turn). Real Claude consistently does back-to-back single-Read turns instead — same context outcome, no work between them. Now passes if the two target Reads are adjacent in the tool-call stream (index diff ≤ 1), which subsumes the strict same-turn case. 2. "Claude delegated to the Explore agent" required an Agent call with subagent_type matching /Explore/i. For tiny fixture codebases (2 files) Claude pragmatically inlines the Reads — the orchestrator-context-lean goal is preserved either way. Now passes if EITHER an Explore delegation happened OR at least one direct Read landed on a file under src/. The strict versions were over-fitting to a single implementation path. The new wording asserts the actual outcome the prompt cares about; tests/README.md table updated to reflect that. * test(e2e): align architecture scenario with /awos:hire-owned coverage Audit branch moved the durable coverage report from architecture.md to context/product/agents.md, owned by /awos:hire. /awos:architecture now only runs a verbal "coverage hint" in chat. - Renames the scenario from architecture-builds-coverage-table to architecture-runs-coverage-hint. - Replaces the table-content assertions with a boundary check: architecture.md MUST NOT contain a Technology|...|Status table (that report belongs to agents.md now). - Loosens the lint test: architecture.md is a "light referencer" that mentions .claude/agents/ without parsing frontmatter; the strict frontmatter requirement still applies to tasks/tech/hire. - Rewrites the fixture roadmap.md in canonical AWOS shape (### Phase headers, italic intros, nested bold checklists) per PR review feedback on PR #117. - Updates tests/README.md table row and INSTRUCTIONS.md. * docs(tests): drop audit-finding ID references in favor of descriptions Lint tests, scenario asserts, INSTRUCTIONS, and tests/README.md referenced audit findings by ID (F3, F5, F8/F9, F11, F12, F18). Those IDs are anchored to docs/awos-alignment-audit.md, which is a snapshot document — readers of the test suite shouldn't need to cross-reference it to understand what each test pins down. Each reference is replaced with a plain-language description of what the test actually verifies: - F12 → "@-import vs legacy 'Refer to…' migration" - F18 → "wrappers must mirror the root command's description" - F3 → "produce durable specialist assignments / coverage report" - F8 → "scope-discipline (don't over-engineer)" - F9 → "investigate-before-answering (don't hallucinate)" - F5 → silent — verification policy is intentionally out of scope - F11 → "wrapper-frontmatter contract" No behavioral changes. 38/38 tests still pass. * test(e2e): extract pathAccessCalls helper; recognize Bash read commands Three scenarios duplicated the agent-discovery union (Glob / Read / LS / Grep / Agent over .claude/agents/). Real Claude sessions also use `Bash ls` to list directories — that wasn't in any of the three copies, so the architecture-runs-coverage-hint scenario falsely failed. Changes: - Extract pathAccessCalls(toolCalls, pathRegex) into tests/e2e/expect.js so all scenarios stay in sync going forward. - Add Bash to the union, but only when the command contains a read-like token (ls / cat / find / grep / head / tail / wc / tree / file / stat / awk / sed / jq / yq / less / more / etc.) AND the path. A bare mention via `echo` or destructive ops like `mkdir` / `rm` does NOT count — Bash is generic, so the heuristic has to filter for reads specifically. - Replace the local discoveryHits in architecture, tasks, and tech scenarios with calls to the shared helper. Verified against a real session where Claude used `ls .claude/agents/` instead of Glob — now correctly recognized as discovery. Suite stays at 38/38. * test: extract session-log E2E into a sibling awos-qa repo Behavioral end-to-end tests (real claude sessions, session-log parsing, per-command scenarios) had grown to outweigh the static suite. Moving them out keeps the AWOS repo focused on framework authoring, lets awos-qa grow other test types (evals, perf, integration) on its own cadence, and avoids prompt-author iteration loading the behavioral surface area into context. Removed: - bin/awos-e2e-{list,prepare,verify}.js - tests/e2e/ (parser, expect DSL, fixtures, four scenarios) - npm scripts: e2e, e2e:list, e2e:prepare, e2e:verify, test:e2e - .gitignore entry for tests/e2e/.last-run.json - Layer-4 sections in CLAUDE.md, tests/README.md (replaced with a one-line pointer at awos-qa for both) CI suite drops from 38 to 32 tests (Layers 1–3 only); all green. * test(lint): cover INTERACTION, plugin agents, skills, hired-agents rename Three commits landed on the base branch since this PR was opened: - 0fe2618 moves the AskUserQuestion rule from each wrapper into a top-level # INTERACTION section in the corresponding core command, extends tech/hire/tasks subagent enumeration to plugin-provided agents (plugin-name: prefix in the Agent tool's description block), and renames context/product/agents.md -> hired-agents.md. - da5ab5b adds a <use_available_skills> block to the implement.md delegation prompt, an Agent(subagent_type=...) invocation example in tech.md Step 2.4 and implement.md Step 3.1, and a skills-apply bullet in templates/agent-template.md. - b119c03 drops the redundant CLAUDE.md preamble (no structural contract worth pinning). Layer 1 lint additions: - Every core commands/*.md declares its own # INTERACTION section that names AskUserQuestion; wrappers must not duplicate that rule. - commands/tech.md, hire.md, tasks.md mention both the "plugin-name:" prefix and the Agent tool's description block (plugin-provided agent discovery path). - commands/implement.md and tech.md show a literal Agent(subagent_type=..., ...) invocation example. - commands/implement.md's XML-snippet check now also requires <use_available_skills> alongside <scope_discipline> and <investigate_before_answering>. - templates/agent-template.md body cues the agent to apply its frontmatter skills:. - context/product/hired-agents.md is referenced by at least one prompt; no prompt still uses the pre-rename context/product/agents.md. tests/README.md updated to describe the new Layer 1 assertions. --------- Co-authored-by: Aleksandr Makarov <amakarov@provectus.com>
1 parent 1e817e5 commit 2ca8539

25 files changed

Lines changed: 1853 additions & 2 deletions

File tree

.github/workflows/quality-check.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,21 @@ jobs:
2020

2121
- name: Run prettier check
2222
run: npx prettier . --check
23+
24+
# Test suite — non-blocking initially.
25+
# Flip `continue-on-error: false` to make this a required gate once it has
26+
# been green on two consecutive PRs (see plan PR 4).
27+
test:
28+
runs-on: ubuntu-latest
29+
continue-on-error: true
30+
steps:
31+
- name: Checkout code
32+
uses: actions/checkout@v4
33+
34+
- name: Setup Node.js
35+
uses: actions/setup-node@v4
36+
with:
37+
node-version: '22'
38+
39+
- name: Run tests
40+
run: npm test

CLAUDE.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,16 @@ bunx prettier . --check
2222
npx prettier --write . # auto-format before committing
2323
bunx prettier --write .
2424

25+
# Run the test suite (no npm deps; node --test built-in):
26+
npm test # all three layers
27+
npm run test:lint # Layer 1 — static prompt linter
28+
npm run test:installer # Layer 2 — installer unit tests
29+
npm run test:fixtures # Layer 3 — fixture-project end-to-end
30+
bun test tests/ # local cross-runtime sanity check (optional)
31+
32+
# Behavioral / session-log E2E lives in the awos-qa repository
33+
# (sibling to this one). See its README for how to run.
34+
2535
# Test installer against a separate project (pick one runner; $AWOS_REPO is the absolute path to this repo):
2636
cd ~/some-scratch-project
2737
npx $AWOS_REPO/index.js
@@ -30,7 +40,27 @@ bun $AWOS_REPO/index.js # direct exec also works
3040
npx $AWOS_REPO/index.js --dry-run # preview only
3141
```
3242

33-
There is no test suite. The installer runs on **Node 22+ or any recent Bun**. It uses only standard JS built-ins (`fs`, `path`) via CommonJS `require`, which both runtimes support — do not add npm dependencies or runtime-specific APIs without strong justification, as that would break cross-runtime compatibility.
43+
The installer runs on **Node 22+ or any recent Bun**. It uses only standard JS built-ins (`fs`, `path`) via CommonJS `require`, which both runtimes support — do not add npm dependencies or runtime-specific APIs without strong justification, as that would break cross-runtime compatibility.
44+
45+
## Testing
46+
47+
The repo has a three-layer test suite under `tests/`, all built on Node's `node:test` built-in — no npm dependencies. See `tests/README.md` for the detailed reference.
48+
49+
1. **Static prompt linter** (`tests/lint-prompts.test.js`) — symmetry, frontmatter, marker presence, cross-references, dimension DAG, copy-table consistency, and grep-style checks for required substrings inside prompt bodies.
50+
2. **Installer unit tests** (`tests/installer/*.test.js`) — exercises the installer services against temp directories.
51+
3. **Fixture projects** (`tests/fixtures.test.js` + `tests/fixtures/<name>/`) — real installer runs against representative pre-install trees, with manifest-based assertions.
52+
53+
All three layers run in CI (`npm test`).
54+
55+
Behavioral end-to-end tests — the ones that run a real Claude Code session against a seeded scratch project and assert on the actual tool-call trace — live in the separate **`awos-qa`** repository (sibling to this one). See its README for how to run them.
56+
57+
### Tests must narrate what they checked
58+
59+
Output that says `N events found` or `M pass` tells you the suite ran, not what was validated. `assert.*` failure messages should name the contract being violated, not just dump a diff. Anyone reading the test output should understand which contracts were verified without opening the test source.
60+
61+
### Adding tests for new contracts
62+
63+
When a change introduces a structural contract — frontmatter key, marker pattern, migration, copy-table entry — its test ships in the same PR. Surface-area contracts (something a grep can catch) go to Layer 1. Mechanical contracts (installer behavior, migration idempotency) go to Layer 2 or 3. Behavioral contracts ("Claude must actually call X") belong in the `awos-qa` repository.
3464

3565
## Architecture: The Two-Folder Customization Model
3666

CONTRIBUTING.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,25 @@ awos/
3434
└── src/ # AWOS installer source code
3535
```
3636

37+
## Running the Test Suite
38+
39+
The repo ships with three test layers under `tests/`, all using Node's `node:test` built-in — no npm dependencies needed.
40+
41+
```bash
42+
npm test # all layers (primary path; CI runs this on Node 22)
43+
npm run test:lint # static prompt linter
44+
npm run test:installer # installer unit tests
45+
npm run test:fixtures # fixture project tests
46+
47+
bun test tests/ # optional local cross-runtime spot-check
48+
```
49+
50+
Layer 1 (`tests/lint-prompts.test.js`) catches wrapper/root-command drift, dimension DAG breaks, and `setup-config.js` mismatches. Layers 2 and 3 (`tests/installer/`, `tests/fixtures.test.js`) exercise the installer against `fs.mkdtemp()` directories and commit-tracked fixture projects.
51+
52+
When you add a new structural contract — a wrapper frontmatter key, an `agent-template.md` field, a migration, a marker pattern — add its lint rule, installer test, or fixture in the **same PR**. The safety net only works if coverage keeps pace.
53+
54+
To add a fixture: drop a directory under `tests/fixtures/<name>/`, optionally with a `before/` subtree (gets copied to the temp project as the starting state) and an `expected-after.json` manifest listing the files to assert. See existing fixtures for examples.
55+
3756
## Testing Changes Locally
3857

3958
### We recommend testing in a Pet Project

package.json

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,10 @@
88
},
99
"main": "index.js",
1010
"scripts": {
11-
"test": "echo \"Error: no test specified\" && exit 1"
11+
"test": "node --test 'tests/**/*.test.js'",
12+
"test:lint": "node --test tests/lint-prompts.test.js",
13+
"test:installer": "node --test 'tests/installer/*.test.js'",
14+
"test:fixtures": "node --test tests/fixtures.test.js"
1215
},
1316
"keywords": [],
1417
"author": "Provectus Inc.",

tests/README.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# AWOS test suite
2+
3+
A three-layer safety net that catches structural regressions in AWOS prompts and installer behavior at PR time. Built on Node's `node:test` built-in — **zero npm dependencies**. Runs identically under `node --test` (CI primary) and `bun test` (local cross-runtime sanity).
4+
5+
## Why this exists
6+
7+
AWOS distributes markdown prompts that users install into their projects. Prompts and the installer share several silent contracts — task markers, file paths, frontmatter fields, dimension DAGs, copy semantics — and a typo or rename in one prompt can break a downstream user's `/awos:implement` run a week later, with no PR-time signal. This suite asserts those contracts so future prompt edits get caught before they ship.
8+
9+
What it does **not** do: validate prompt _behavior_ (does the LLM actually do the right thing when run?). That requires an LLM in the loop and an API budget; we rely on the scratch-project smoke test described in the root `CLAUDE.md` for that.
10+
11+
## Running the suite
12+
13+
```sh
14+
# Primary path (Node 22+, what CI runs)
15+
npm test
16+
17+
# Per-layer
18+
npm run test:lint # Layer 1
19+
npm run test:installer # Layer 2
20+
npm run test:fixtures # Layer 3
21+
22+
# Local cross-runtime sanity check
23+
bun test tests/
24+
```
25+
26+
CI runs `npm test` under Node 22 in `.github/workflows/quality-check.yml` (non-blocking initially; flip to required after two consecutive green PR runs).
27+
28+
## Layout
29+
30+
```
31+
tests/
32+
├── README.md # this file
33+
├── lint-prompts.test.js # Layer 1: static prompt linter
34+
├── config/
35+
│ └── wrapper-schema.json # which wrapper frontmatter fields are required
36+
├── installer/ # Layer 2: installer unit tests
37+
│ ├── file-copier.test.js
38+
│ ├── migration-runner.test.js
39+
│ └── setup-orchestrator.test.js
40+
├── fixtures.test.js # Layer 3: harness for example projects
41+
├── fixtures/ # Layer 3: example projects
42+
│ ├── fresh-project/
43+
│ ├── existing-awos-v0/
44+
│ ├── customized-wrapper/
45+
│ ├── mid-workflow/
46+
│ └── pre-migration-v1/
47+
└── helpers/
48+
├── frontmatter.js # minimal YAML-frontmatter parser, no deps
49+
├── manifest.js # load + assert fixture manifests
50+
└── temp-project.js # mkdtemp / copyTree / silenced helpers
51+
```
52+
53+
Behavioral end-to-end tests (real `claude` sessions, session-log parsing) live in the separate **`awos-qa`** repository.
54+
55+
## Layer 1 — Static prompt linter
56+
57+
`tests/lint-prompts.test.js`. Reads markdown across `commands/`, `claude/commands/`, `templates/`, and `plugins/awos/skills/ai-readiness-audit/dimensions/` and asserts:
58+
59+
- **Wrapper symmetry.** Every `claude/commands/<name>.md` has a matching `commands/<name>.md`.
60+
- **Wrapper include line.** Each wrapper contains either `@.awos/commands/<name>.md` (preferred) or the legacy `Refer to the instructions located in this file: .awos/commands/<name>.md`. Output logs the count of each form so the migration to `@`-import is visible.
61+
- **Wrapper frontmatter schema.** Required keys defined in `tests/config/wrapper-schema.json`. Tighten this file (don't edit the test) when new wrapper-frontmatter contracts are added.
62+
- **Wrapper description matches root.** Drift between a wrapper's `description` and the corresponding root command's `description` fails the suite — the slash-command palette shows the wrapper's text, so it has to stay in sync with the canonical one.
63+
- **Agent marker preservation.** `commands/tasks.md` (writer) and `commands/implement.md` (reader) both contain the literal `**[Agent: ` marker token — this is how the orchestrator extracts each task's specialist assignment.
64+
- **XML scope, investigate, and skills snippets.** `commands/implement.md` contains `<scope_discipline>` (don't over-engineer), `<investigate_before_answering>` (don't hallucinate), and `<use_available_skills>` (apply matching project/user/plugin skills) — all passed through to the delegated subagent prompt.
65+
- **`Agent()` invocation example.** `commands/implement.md` and `commands/tech.md` both show an explicit `Agent(subagent_type=..., ...)` call so the delegation step is concrete, not just described.
66+
- **`INTERACTION` section in every core command.** Every `commands/*.md` declares its own `# INTERACTION` section that names `AskUserQuestion`. Wrappers must _not_ duplicate that rule — AWOS targets Claude Code only, so the tool is a framework default, not host-specific customization.
67+
- **Subagent discovery (filesystem + plugins).** `commands/tasks.md`, `commands/tech.md`, and `commands/hire.md` reference both `.claude/agents/` (project-local, parsed via frontmatter) _and_ the `Agent` tool's description block (plugin-provided agents, recognized by the `plugin-name:` prefix on `subagent_type`).
68+
- **`agent-template.md` cues skills application.** The body of `templates/agent-template.md` instructs spawned agents to apply the skills listed in their frontmatter — without this, `/awos:hire`'s skill-attachment work is inert at run time.
69+
- **`context/product/hired-agents.md` rename pinned.** The `/awos:hire`-owned coverage report is referenced at its post-rename path; no prompt still references the legacy `context/product/agents.md`.
70+
- **Slash-command cross-references.** Every `/awos:<word>` mentioned in any prompt resolves to a real `commands/<word>.md` (or the plugin path for `/awos:ai-readiness-audit`).
71+
- **Dimension DAG.** Every dimension under `plugins/awos/skills/ai-readiness-audit/dimensions/*.md` has required frontmatter, `name` matches its filename, severity is in the allowed set, `depends-on` entries resolve to real dimension names, and the graph topologically sorts (no cycles).
72+
- **`context/...` path consistency.** Cross-prompt path references are mutually reachable — if two prompts read the same path, at least one writer of it must exist.
73+
- **`setup-config.js` ↔ source-tree consistency.** Every `copyOperation.source` directory exists; every top-level source directory matching `^(commands|templates|scripts|claude)/` is referenced by exactly one `copyOperation`.
74+
75+
Cost: ~30 ms. Catches roughly 80 % of structural regressions on its own.
76+
77+
## Layer 2 — Installer unit tests
78+
79+
`tests/installer/*.test.js`. Exercises `src/services/file-copier.js`, `src/migrations/runner.js`, and `src/core/setup-orchestrator.js` against `fs.mkdtemp()` temp directories. Only Node built-ins, only public exports of the installer modules — no monkey-patching.
80+
81+
- **`file-copier.test.js`**
82+
- Fresh install lands every source file at its declared destination.
83+
- Synthetic `commands/synth-test.md` is auto-discovered (validates "no `setup-config.js` edit needed when adding files inside an existing tree").
84+
- Wrapper overwrite behavior pinned to current code (`.claude/commands/awos/*.md` _is_ overwritten on update). Comments in the test point at the open §11 docs-vs-code question; flip the assertion when that's resolved intentionally.
85+
- Dry-run honesty: `dryRun: true` produces zero filesystem changes.
86+
- **`migration-runner.test.js`**
87+
- Migration 001 is idempotent (run twice, second run is a no-op).
88+
- `skip_if_any` triggers on already-migrated state and reports `already_applied`.
89+
- Migration version meta-test: every JSON under `src/migrations/` has a unique version, no gaps, no duplicates.
90+
- Dry-run does not touch disk.
91+
- **`setup-orchestrator.test.js`**
92+
- End-to-end `runSetup({ workingDir, packageRoot })` against a temp dir completes without throwing.
93+
- Re-running on an existing install is idempotent on the on-disk side.
94+
95+
Cost: ~50 ms.
96+
97+
## Layer 3 — Example fixture projects
98+
99+
`tests/fixtures.test.js` is a harness that runs once per directory under `tests/fixtures/`. For each fixture:
100+
101+
1. Make a fresh `fs.mkdtemp()` temp dir.
102+
2. If the fixture has a `before/` subtree, copy it into the temp dir.
103+
3. Run the real installer (`runSetup({ workingDir, packageRoot: repoRoot })`).
104+
4. Load `expected-after.json` and assert the resulting tree matches the manifest.
105+
106+
Each `expected-after.json` lists files with one or more of: `{ exists, sha256, contains, unchanged }`. Files not listed are not asserted — fixtures are deliberately selective.
107+
108+
Currently shipped fixtures:
109+
110+
| Fixture | Scenario | What it pins down |
111+
| --------------------- | ----------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
112+
| `fresh-project/` | Empty project | Full install layout: `.awos/commands/`, `.claude/commands/awos/`, `context/`, `.awos/.migration-version` |
113+
| `existing-awos-v0/` | Stale `.awos/commands/architecture.md` from a prior install | Framework internals always get the latest content (overwritten) |
114+
| `customized-wrapper/` | User-customized `.claude/commands/awos/architecture.md` | Pins the current always-overwrite behavior; see the §11 open question in the plan |
115+
| `mid-workflow/` | Populated `context/spec/001-test-feature/*.md` | Installer never touches user spec work |
116+
| `pre-migration-v1/` | `.claude/agents/python-expert.md` at the pre-v1 path | Migrations 001 + 002 land cleanly and the version file reads `2` |
117+
118+
Adding a new fixture: create `tests/fixtures/<name>/`, optionally with a `before/` subtree, plus an `expected-after.json` manifest. The harness picks it up automatically.
119+
120+
Cost: ~65 ms for all five.
121+
122+
## Behavioral end-to-end tests live in the `awos-qa` repo
123+
124+
Static lint catches "prompt mentions X"; only running the real LLM catches "Claude actually did X". That second class of test lives in the separate **`awos-qa`** repository, sibling to this one. It drives a Claude Code session against a seeded scratch project and parses the resulting session log to assert on the tool-call trace.
125+
126+
It's intentionally a separate repo so prompt-author iteration here doesn't pull in the behavioral-test surface area, and so awos-qa can grow other test types (perf, evals, integration) without coupling them to AWOS's release cycle.
127+
128+
## Adding tests for new contracts
129+
130+
The rule (also in the root `CLAUDE.md`): **any PR that introduces a new structural contract must ship its test in the same PR.**
131+
132+
- New wrapper frontmatter key → add it to `tests/config/wrapper-schema.json`.
133+
- New required marker in a prompt → add a `test('marker preserved', …)` to `tests/lint-prompts.test.js`.
134+
- New migration in `src/migrations/` → add an idempotency + skip-semantics test to `tests/installer/migration-runner.test.js`. If user wrappers or agents are rewritten, add a fixture under `tests/fixtures/` that exercises a representative pre-migration tree.
135+
- New copy operation in `src/config/setup-config.js` → the consistency check in Layer 1 will fail unless the matching source directory exists; the fixture suite picks up the new destination automatically once any fixture asserts a file under it.
136+
- New audit dimension → Layer 1's DAG check picks it up automatically; just make sure the frontmatter is complete.
137+
138+
## Constraints (don't break these)
139+
140+
- **No npm dependencies.** AWOS's installer is dep-free for cross-runtime portability. Tests inherit that constraint.
141+
- **Cross-runtime compatible.** Same files must run under both `node --test` and `bun test`. Avoid Node-only APIs Bun lacks.
142+
- **Tests assert today's code as truth.** If a test fails after a code change you didn't intend to make, fix the code, not the test. If you intentionally changed a contract, update the test in the same commit and explain why in the message.

tests/config/wrapper-schema.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"_comment": "Required and optional frontmatter keys for claude/commands/*.md wrappers. Move keys from `optional` to `required` when a new wrapper-frontmatter contract is introduced.",
3+
"required": ["description"],
4+
"optional": [
5+
"argument-hint",
6+
"disable-model-invocation",
7+
"allowed-tools",
8+
"model"
9+
]
10+
}

0 commit comments

Comments
 (0)