Skip to content

Night Watch Vibe Test Runner

Cindy Zhang edited this page Jun 30, 2026 · 4 revisions

Night Watch: Vibe Test Runner Role

Assigned to: Cindy's Navi (cixzhang) Goal: Run the nightly Astryx vibe test suite with fair, isolated environments, then file ONE per-night run issue with the scores + correctness failures. Frequency: Once per night (~04:00 PDT).

Division of labor (read this first). The Runner is the dumb, reliable half: run the test, push the results branch, file the per-night run issue, hand off. It does NOT touch the wiki ledger and does NOT post to the API Concerns tracker (#3164). All of the stateful, tricky work — consuming the results, opening fix PRs, commenting #3164, appending the wiki ledger, and closing the per-night issue — belongs to the Vibe Test Debugger, which the Runner triggers as its final step. This split exists because the old "Runner appends the wiki itself" flow kept corrupting the ledger (blank-line-in-table bugs) and scattered insight-handling. Keep the Runner mechanical.


Architecture

NAVI (orchestrator)
  │
  ├─ Phase 0: Preflight — verify CLI + environment
  │   pnpm install, pnpm build (packages/cli)
  │   Verify npx astryx --help works in a test project dir
  │
  ├─ Phase 1: Setup + Generate code
  │   setup-nightly.mjs → creates per-agent isolated project dirs
  │   Spawn 40 sub-agents: 10 Astryx + 10 Astryx+TW + 10 baseline + 10 HTML
  │   Each agent works in its own project directory (no cross-contamination)
  │   Output: .tsx/.json in each agent's project dir
  │
  ├─ Phase 1.5: Collect results (collect-results.mjs)
  ├─ Phase 2: Build previews + tsc type-checking (build-previews.ts → build-errors.json)
  ├─ Phase 3: Evaluate (universal-aggregate.ts, universal-compare.ts)
  │
  ├─ Phase 4: Persist + File issue
  │   Deploy report to gh-pages
  │   Commit results to branch: vibe-test/nightly-YYYY-MM-DD (push)
  │   File ONE per-night run issue (label: vibe-test-nightly) with scores + correctness failures
  │   Record the issue number in state (vibeTestRunner.lastIssue)
  │
  ├─ Phase 5: Verify the issue was filed (read it back)
  │
  └─ Phase 6: Hand off to the Debugger (trigger it; safety-net cron backs this up)

Checker Protocol

Before running, verify these 5 invariants (see internal/vibe-tests/README.md):

  1. Fair evaluators — same scoring logic across targets (target-aware counting is OK)
  2. Only the system varies — same prompts, no system-specific coaching rules
  3. Never leak the answerexpectedComponents never appears in agent prompts
  4. Representative environment — each agent sees what a real consumer would see
  5. Context-free agents — fresh spawn per prompt, no inherited knowledge

Agent Environments

Each agent gets an isolated project directory cloned from environments/project-{target}/:

Target Agent sees Discovery path
Astryx package.json + node_modules/@astryxdesign/core/ (symlinked to real source) + working CLI lsnpx astryx --help or node_modules/@astryxdesign/core/README.md → component docs
Astryx+TW Same as Astryx + Tailwind CSS available Same discovery + Tailwind utility classes
Baseline package.json + components/ui/*.tsx + lib/ + README.md lsREADME.md → real shadcn source
HTML package.json only (bare React project) No design system — plain HTML + inline CSS

Agent prompts say: "Your project is at <path>. Explore it to find how to look up component docs." No README paths, no CLI commands, no component names are given.


Nightly Checklist

Phase 0: Preflight (CRITICAL — do not skip)

cd /vercel/sandbox/repos/xds
git fetch origin main && git checkout origin/main
pnpm install
pnpm --filter @astryxdesign/cli build   # required for npx astryx to work in agent project dirs
# Verify the CLI binary exists and runs (npx astryx --help) before spawning agents.

If preflight fails: STOP. Do not proceed. Log the error in state and the daily note.

Phase 1: Setup + Generate Code

node internal/vibe-tests/src/setup-nightly.mjs --sample 10

Outputs 4 iteration IDs (astryx, astryx-tailwind, baseline, html), 40 task files, 40 isolated project dirs.

Before spawning, verify the checker protocol on the generated task files (same task text across targets, no expectedComponents, no system-specific coaching). Spawn 40 fresh, context-free sub-agents — each works in its own project dir and writes .tsx + .json.

Phase 1.5 / 2 / 3: Collect, Build, Evaluate

cd internal/vibe-tests
node src/collect-results.mjs <each-iteration-id>            # expect 10 .tsx + 10 .json each
npx tsx src/build-previews.ts --iterations "<a>,<a-tw>,<b>,<h>"   # → build-errors.json (key debugging artifact)
npx tsx src/universal-aggregate.ts --iteration <each>
npx tsx src/universal-compare.ts --astryx <a> --baseline <b> --html <h>
npx tsx src/universal-compare.ts --astryx <a> --baseline <a-tw>

Read build-errors.json for each iteration and record total error count, clean vs erroring prompts, and the actual error messages (for the issue body).

Phase 4: Persist + File the per-night run issue

Deploy report + push results branch:

npx tsx src/deploy-report.ts --iteration <a> --baseline <b> --html <h> --astryx-tailwind <a-tw>
cd /vercel/sandbox/repos/xds
git checkout -b vibe-test/nightly-$(date +%Y-%m-%d)
git add internal/vibe-tests/results/
git -c user.name="Cindy Zhang" -c user.email="cixzhang@users.noreply.github.com" \
  commit -m "vibe-test: nightly results $(date +%Y-%m-%d)"
git push origin HEAD

File ONE per-night run issue (this is the breadcrumb the Debugger consumes and then closes — NOT a permanent record). Use --body-file, label vibe-test-nightly, and a dated title:

gh issue create --repo facebook/astryx \
  --title "Vibe Test — nightly $(date +%Y-%m-%d)" \
  --label "vibe-test-nightly" \
  --body-file /tmp/vibe-issue-body.md

The issue body MUST contain:

  • The 4-target scores table (Overall + the 5 Astryx dimensions).
  • Winner.
  • Iteration IDs + the results branch link (vibe-test/nightly-<date>).
  • The Correctness Failures section (see template below) with per-prompt tsc errors — this is what the Debugger uses to classify CLI-doc vs setup vs API issues.

Do NOT append to the wiki ledger and do NOT post to #3164. Those are the Debugger's job. The Runner only files the per-night issue.

Phase 5: Verify

Read the issue back (gh issue view <n> --repo facebook/astryx) and confirm the body posted correctly (not a @/tmp/... path). Record the issue number in memory/xds-night-watch-state.json under vibeTestRunner.lastIssue along with lastIterations, results, correctnessDetails, and a note.

Phase 6: Hand off to the Debugger

The Debugger does all downstream work (PRs, #3164, wiki ledger, closing this issue). Trigger it now so there's no 4 AM/6 AM race:

  1. Preferred (tight trigger): create a one-shot schedule that fires the Debugger ~immediately. Point it at the Debugger protocol (do not inline the protocol):
    • name: Vibe Test Debugger — triggered run
    • schedule: one-shot at ≈ now + 5 min, deleteAfterRun: true, silent: false
    • message: "Run the Night Watch Vibe Test Debugger for tonight's nightly. Read https://github.com/facebook/astryx/wiki/Night-Watch-Vibe-Test-Debugger and execute it against the per-night run issue #<N> (the one just filed) and branch vibe-test/nightly-&lt;date&gt;."
  2. Safety net (always present): the standing Vibe Test Debugger — Nightly Fix cron runs later the same morning and is idempotent — if tonight's per-night issue is already closed, it does nothing. So even if the tight trigger fails, the Debugger still runs.

If the trigger step itself errors, that is non-fatal — log it and rely on the safety-net cron.


Correctness Debugging Section (Issue Template)

When correctness < 100%, include this in the per-night issue:

### Correctness Failures

| Prompt | Target | Errors | Root Cause (Debugger fills/validates) |
|--------|--------|--------|---------------------------------------|
| fwc-6 | Astryx | 3 | Wrong prop type for `variant` — passed "primary", expects "filled" |
| sd-1 | Astryx | 5 | Used `Spinner` (doesn't exist), should be `ProgressCircle` |

<details>
<summary>Full tsc errors (Astryx iteration)</summary>

**fwc-6.tsx:**

fwc-6.tsx(12,5): error TS2322: Type '"primary"' is not assignable to type '"filled" | "outlined" | "ghost"'.

</details>

Known harness caveat: Astryx correctness is often depressed by TS2307 "cannot find module" errors when tsc can't resolve the symlinked @astryxdesign/core/* subpaths in the isolated project dirs. Note this in the issue so the Debugger doesn't chase module-resolution noise as a real API/CLI defect.


Failure Modes & Recovery

Failure Recovery
CLI not built / npx astryx fails pnpm --filter @astryxdesign/cli build. If it still fails, skip Astryx iteration and report.
Agent didn't read docs (correctness tanks) Check if docsRead in result .json is empty — discovery failure
Symlinks broken Re-run setup-nightly.mjs — it recreates project dirs fresh
Some agents timed out Log missing prompts, proceed with partial results, note in the issue
gh issue create body mismatch Always use --body-file. Read the issue back to verify.
Results not in repo Branch push failed — retry. Results MUST be persisted before filing the issue.

State

Track in memory/xds-night-watch-state.json under vibeTestRunner:

  • lastRun, lastIterations {astryx, "astryx-tailwind", baseline, html}
  • lastIssue: the per-night run issue number (MUST be set — the Debugger reads it)
  • results: per-target scores, correctnessDetails: {promptId: {errorCount, errors[]}}
  • note: human-readable summary

Clone this wiki locally