-
Notifications
You must be signed in to change notification settings - Fork 231
Night Watch Vibe Test Runner
Assigned to: Cindy's Navi (cixzhang)
Goal: Run the nightly Astryx vibe test suite with fair, isolated environments, then file ONE per-night run issue with the scores + correctness failures.
Frequency: Once per night (~04:00 PDT).
Division of labor (read this first). The Runner is the dumb, reliable half: run the test, push the results branch, file the per-night run issue, hand off. It does NOT touch the wiki ledger and does NOT post to the API Concerns tracker (#3164). All of the stateful, tricky work — consuming the results, opening fix PRs, commenting #3164, appending the wiki ledger, and closing the per-night issue — belongs to the Vibe Test Debugger, which the Runner triggers as its final step. This split exists because the old "Runner appends the wiki itself" flow kept corrupting the ledger (blank-line-in-table bugs) and scattered insight-handling. Keep the Runner mechanical.
NAVI (orchestrator)
│
├─ Phase 0: Preflight — verify CLI + environment
│ pnpm install, pnpm build (packages/cli)
│ Verify npx astryx --help works in a test project dir
│
├─ Phase 1: Setup + Generate code
│ setup-nightly.mjs → creates per-agent isolated project dirs
│ Spawn 40 sub-agents: 10 Astryx + 10 Astryx+TW + 10 baseline + 10 HTML
│ Each agent works in its own project directory (no cross-contamination)
│ Output: .tsx/.json in each agent's project dir
│
├─ Phase 1.5: Collect results (collect-results.mjs)
├─ Phase 2: Build previews + tsc type-checking (build-previews.ts → build-errors.json)
├─ Phase 3: Evaluate (universal-aggregate.ts, universal-compare.ts)
│
├─ Phase 4: Persist + File issue
│ Deploy report to gh-pages
│ Commit results to branch: vibe-test/nightly-YYYY-MM-DD (push)
│ File ONE per-night run issue (label: vibe-test-nightly) with scores + correctness failures
│ Record the issue number in state (vibeTestRunner.lastIssue)
│
├─ Phase 5: Verify the issue was filed (read it back)
│
└─ Phase 6: Hand off to the Debugger (trigger it; safety-net cron backs this up)
Before running, verify these 5 invariants (see internal/vibe-tests/README.md):
- Fair evaluators — same scoring logic across targets (target-aware counting is OK)
- Only the system varies — same prompts, no system-specific coaching rules
-
Never leak the answer —
expectedComponentsnever appears in agent prompts - Representative environment — each agent sees what a real consumer would see
- Context-free agents — fresh spawn per prompt, no inherited knowledge
Each agent gets an isolated project directory cloned from environments/project-{target}/:
| Target | Agent sees | Discovery path |
|---|---|---|
| Astryx |
package.json + node_modules/@astryxdesign/core/ (symlinked to real source) + working CLI |
ls → npx astryx --help or node_modules/@astryxdesign/core/README.md → component docs |
| Astryx+TW | Same as Astryx + Tailwind CSS available | Same discovery + Tailwind utility classes |
| Baseline |
package.json + components/ui/*.tsx + lib/ + README.md
|
ls → README.md → real shadcn source |
| HTML |
package.json only (bare React project) |
No design system — plain HTML + inline CSS |
Agent prompts say: "Your project is at <path>. Explore it to find how to look up component docs." No README paths, no CLI commands, no component names are given.
cd /vercel/sandbox/repos/xds
git fetch origin main && git checkout origin/main
pnpm install
pnpm --filter @astryxdesign/cli build # required for npx astryx to work in agent project dirs
# Verify the CLI binary exists and runs (npx astryx --help) before spawning agents.If preflight fails: STOP. Do not proceed. Log the error in state and the daily note.
node internal/vibe-tests/src/setup-nightly.mjs --sample 10Outputs 4 iteration IDs (astryx, astryx-tailwind, baseline, html), 40 task files, 40 isolated project dirs.
Before spawning, verify the checker protocol on the generated task files (same task text across targets, no expectedComponents, no system-specific coaching). Spawn 40 fresh, context-free sub-agents — each works in its own project dir and writes .tsx + .json.
cd internal/vibe-tests
node src/collect-results.mjs <each-iteration-id> # expect 10 .tsx + 10 .json each
npx tsx src/build-previews.ts --iterations "<a>,<a-tw>,<b>,<h>" # → build-errors.json (key debugging artifact)
npx tsx src/universal-aggregate.ts --iteration <each>
npx tsx src/universal-compare.ts --astryx <a> --baseline <b> --html <h>
npx tsx src/universal-compare.ts --astryx <a> --baseline <a-tw>Read build-errors.json for each iteration and record total error count, clean vs erroring prompts, and the actual error messages (for the issue body).
Deploy report + push results branch:
npx tsx src/deploy-report.ts --iteration <a> --baseline <b> --html <h> --astryx-tailwind <a-tw>
cd /vercel/sandbox/repos/xds
git checkout -b vibe-test/nightly-$(date +%Y-%m-%d)
git add internal/vibe-tests/results/
git -c user.name="Cindy Zhang" -c user.email="cixzhang@users.noreply.github.com" \
commit -m "vibe-test: nightly results $(date +%Y-%m-%d)"
git push origin HEADFile ONE per-night run issue (this is the breadcrumb the Debugger consumes and then closes — NOT a permanent record). Use --body-file, label vibe-test-nightly, and a dated title:
gh issue create --repo facebook/astryx \
--title "Vibe Test — nightly $(date +%Y-%m-%d)" \
--label "vibe-test-nightly" \
--body-file /tmp/vibe-issue-body.mdThe issue body MUST contain:
- The 4-target scores table (Overall + the 5 Astryx dimensions).
- Winner.
- Iteration IDs + the results branch link (
vibe-test/nightly-<date>). - The Correctness Failures section (see template below) with per-prompt tsc errors — this is what the Debugger uses to classify CLI-doc vs setup vs API issues.
Do NOT append to the wiki ledger and do NOT post to #3164. Those are the Debugger's job. The Runner only files the per-night issue.
Read the issue back (gh issue view <n> --repo facebook/astryx) and confirm the body posted correctly (not a @/tmp/... path). Record the issue number in memory/xds-night-watch-state.json under vibeTestRunner.lastIssue along with lastIterations, results, correctnessDetails, and a note.
The Debugger does all downstream work (PRs, #3164, wiki ledger, closing this issue). Trigger it now so there's no 4 AM/6 AM race:
-
Preferred (tight trigger): create a one-shot schedule that fires the Debugger ~immediately. Point it at the Debugger protocol (do not inline the protocol):
- name:
Vibe Test Debugger — triggered run - schedule: one-shot
at≈ now + 5 min,deleteAfterRun: true,silent: false - message: "Run the Night Watch Vibe Test Debugger for tonight's nightly. Read https://github.com/facebook/astryx/wiki/Night-Watch-Vibe-Test-Debugger and execute it against the per-night run issue #<N> (the one just filed) and branch
vibe-test/nightly-<date>."
- name:
- Safety net (always present): the standing Vibe Test Debugger — Nightly Fix cron runs later the same morning and is idempotent — if tonight's per-night issue is already closed, it does nothing. So even if the tight trigger fails, the Debugger still runs.
If the trigger step itself errors, that is non-fatal — log it and rely on the safety-net cron.
When correctness < 100%, include this in the per-night issue:
### Correctness Failures
| Prompt | Target | Errors | Root Cause (Debugger fills/validates) |
|--------|--------|--------|---------------------------------------|
| fwc-6 | Astryx | 3 | Wrong prop type for `variant` — passed "primary", expects "filled" |
| sd-1 | Astryx | 5 | Used `Spinner` (doesn't exist), should be `ProgressCircle` |
<details>
<summary>Full tsc errors (Astryx iteration)</summary>
**fwc-6.tsx:**fwc-6.tsx(12,5): error TS2322: Type '"primary"' is not assignable to type '"filled" | "outlined" | "ghost"'.
</details>
Known harness caveat: Astryx correctness is often depressed by TS2307 "cannot find module" errors when tsc can't resolve the symlinked
@astryxdesign/core/*subpaths in the isolated project dirs. Note this in the issue so the Debugger doesn't chase module-resolution noise as a real API/CLI defect.
| Failure | Recovery |
|---|---|
CLI not built / npx astryx fails |
pnpm --filter @astryxdesign/cli build. If it still fails, skip Astryx iteration and report. |
| Agent didn't read docs (correctness tanks) | Check if docsRead in result .json is empty — discovery failure |
| Symlinks broken | Re-run setup-nightly.mjs — it recreates project dirs fresh |
| Some agents timed out | Log missing prompts, proceed with partial results, note in the issue |
gh issue create body mismatch |
Always use --body-file. Read the issue back to verify. |
| Results not in repo | Branch push failed — retry. Results MUST be persisted before filing the issue. |
Track in memory/xds-night-watch-state.json under vibeTestRunner:
-
lastRun,lastIterations{astryx, "astryx-tailwind", baseline, html} -
lastIssue: the per-night run issue number (MUST be set — the Debugger reads it) -
results: per-target scores,correctnessDetails:{promptId: {errorCount, errors[]}} -
note: human-readable summary