Skip to content

Night Watch Vibe Test Debugger

Cindy Zhang edited this page Jun 29, 2026 · 1 revision

Night Watch: Vibe Test Debugger Role

Assigned to: Cindy's Navi (cixzhang) Goal: Consume the nightly vibe-test run, fix what's fixable, record the durable history. Frequency: Once per night, AFTER the Vibe Test Runner files its per-night issue. Triggered by the Runner (Phase 6); a standing safety-net cron backs it up.

The Debugger is the brain of the pipeline. The Runner is mechanical: it runs the test, pushes the results branch, and files ONE per-night run issue (label vibe-test-nightly). The Debugger owns everything stateful and tricky:

  1. Analyze the per-night issue + results branch.
  2. Fix what's fixable (CLI docs / setup) via a PR — gated by backwards validation.
  3. Comment API concerns on the rolling aggregator #3164 (never a new issue).
  4. Append the scores to the wiki ledger Vibe-Test-Scores — correctly formatted.
  5. Close the per-night run issue once its content is absorbed into the wiki + #3164.

The per-night issue is a transient breadcrumb. The wiki ledger (numbers) and #3164 (API discussion) are the permanent record.


Core principle: every fix is gated by backwards validation

You must vibe-test BEFORE the change to prove the failure reproduces, then vibe-test AFTER the change to prove it's resolved. A fix that didn't reproduce beforehand, or didn't resolve afterward, does NOT ship. See the Backwards-Validation Protocol below — it is mandatory.


Phase 1: Locate tonight's run

  1. cd /vercel/sandbox/repos/xds && git fetch origin main && git checkout origin/main --detach
  2. Find the per-night run issue: gh issue list --repo facebook/astryx --label vibe-test-nightly --state open --limit 5.
    • The newest open one dated today is tonight's run. Read it: gh issue view <N> --repo facebook/astryx.
    • Cross-check memory/xds-night-watch-state.jsonvibeTestRunner.lastIssue / lastIterations.
  3. Gate: If there is no open vibe-test-nightly issue dated today (e.g. the issue is already closed, or the Runner didn't run), the work is already done or not ready → NO_REPLY. This is what makes the safety-net cron idempotent: a closed issue means "already processed."
  4. Fetch the results branch the Runner pushed: git fetch origin vibe-test/nightly-<date> and read build-errors.json + universal.json for the Astryx and Astryx+TW iterations (per-prompt tsc errors + correctness failures). These are also summarized in the issue body.

Phase 2: Classify each failure

For each correctness failure, classify as:

A) CLI Docs Issue — agent used wrong import paths/prop names because the CLI gave misleading info.

  • Verify: run npx astryx component <name> in the project dir; would the output mislead an agent?
  • Fix: edit the CLI doc source (packages/core/src/<Component>/<Component>.doc.mjs or packages/cli/src/).

B) Setup Fairness Issue — the test environment disadvantages one target.

  • Verify: compare internal/vibe-tests/environments/project-*/README.md across targets.
  • Fix: update the environment README / setup scripts so ALL targets get equivalent guidance.

C) API Issue — the component's actual TS types don't match reasonable developer expectations.

  • DO NOT fix code/types. Instead comment on #3164 (Phase 5) after backwards-validation reproduce.

Ignore harness noise: TS2307 "cannot find module" for symlinked @astryxdesign/core/* subpaths is a tsc resolution artifact in the isolated project dirs, not a real defect. Don't chase it as a CLI/API issue.

Phase 3: Verify CLI issues, then prove before fixing

For each CLI-docs candidate: navigate to internal/vibe-tests/results/<iter>/projects/<prompt-id>/, run npx astryx component <Name>, and confirm the output is genuinely wrong/misleading (trace the .doc.mjs source). If the CLI output is actually correct, the agent error is variance — no fix.

Phase 3.5: Backwards Validation — REPRODUCE (required gate)

Before ANY change, prove the failure reproduces against clean (unfixed) origin/main. Run the Backwards-Validation Protocol in REPRODUCE mode for the exact failing prompt ID(s), persona naive, N=3 fresh context-free runs.

  • Reproduced (same error class in ≥2/3 runs) → real & fixable; capture before-evidence; go to Phase 4.
  • Not reproduced (≤1/3) → model variance; do NOT change anything; log it and move on.

Phase 4: Make the fix (only if reproduced)

WORKTREE="/vercel/sandbox/repos/xds-worktrees/vibe-debugger-$(date +%Y%m%d)"
mkdir -p /vercel/sandbox/repos/xds-worktrees
cd /vercel/sandbox/repos/xds
git worktree add "$WORKTREE" origin/main --detach
cd "$WORKTREE"
git checkout -b navi/fix/vibe-test-$(date +%Y%m%d)

Apply CLI-doc or setup fixes only (never component source / types). Then pnpm install && pnpm --filter @astryxdesign/cli build so the rebuilt CLI is what the AFTER validation resolves, and run relevant tests.

Phase 4.5: Backwards Validation — VALIDATE (required gate)

From inside the fixed+rebuilt worktree, run the protocol in VALIDATE mode: SAME prompt ID(s), persona, N=3. Only your fix differs from the reproduce run.

  • Resolved (error class gone in 3/3) → ship.
  • Still appears → fix insufficient; iterate or back out and re-classify (may be an API issue → Phase 5). Never push a fix that fails this gate.

Phase 4.6: Commit, push, PR

export GIT_AUTHOR_NAME="Cindy Zhang" GIT_AUTHOR_EMAIL="cixzhang@users.noreply.github.com" \
       GIT_COMMITTER_NAME="Cindy Zhang" GIT_COMMITTER_EMAIL="cixzhang@users.noreply.github.com"
git add -A
git commit -m "fix(cli): correct <component> docs for agent discoverability"
git push -u origin navi/fix/vibe-test-$(date +%Y%m%d)
gh pr create --repo facebook/astryx --label documentation \
  --title "fix(cli): correct docs from vibe test $(date +%Y-%m-%d)" \
  --body-file /tmp/vibe-pr-body.md

PR body MUST include the backwards-validation evidence: Before (reproduce, unfixed origin/main) reproduced in X/3 runs + the verbatim error + before-iteration IDs; After (validate, fixed worktree) resolved 3/3 + after-iteration IDs; same prompt IDs / persona / N, only the fix changed.

Phase 5: API concerns → comment the rolling tracker #3164

API concerns get backwards-validation REPRODUCE only (N=3, naive) — no VALIDATE (you change nothing). Read the existing #3164 thread + tracking table first; if the concern is already tracked, note the NEW recurrence (which nightly, repro count) so the 2+-nightly signal accumulates in one place. Then gh issue comment 3164 --body-file ... with: component, the wrong-vs-expected API, classification (API candidate — not fixing in CLI docs), reproduction (X/3, prompt id, iterations), recurrence (list of nightly dates), and a validation checklist. If reproduce is ≤1/3, comment "API test fail — not reproducible" and do not escalate. Never open a new issue; never change API/types based on a single nightly.

Phase 6: Append the wiki ledger (the Debugger owns this now)

Append tonight's scores to Vibe-Test-Scores.

TOKEN=$(gh auth token)
git clone "https://x-access-token:${TOKEN}@github.com/facebook/astryx.wiki.git" /tmp/astryx-wiki
cd /tmp/astryx-wiki

Append ONE row to the ## Overall table and ONE row to the ## Astryx dimension breakdown table.

⚠️ Formatting rule — do NOT break the table (this is the bug that caused the rework)

GitHub-flavored Markdown ends a table at the first blank line. A new row must be appended immediately after the last existing row, with NO blank line between rows. Likewise, never leave a row directly adjacent to a following ## heading without a blank line BEFORE the heading. Concretely:

  • The new row goes on the line directly below the current last data row (no gap).
  • There must be exactly ONE blank line between the last row of a table and the next ## section.
  • Use for any target not run. Keep the 0–100 scale. Do not reformat existing rows.

After editing, sanity-check: grep -n "^$" Vibe-Test-Scores.md should show NO blank line sitting between two | ... | rows. Render-verify by re-reading the file: every dated row must sit under a header+separator with no intervening blank line.

Link the row's "Run" cell to the results branch: [results](https://github.com/facebook/astryx/tree/vibe-test/nightly-<date>).

git -c user.name="Cindy Zhang" -c user.email="cixzhang@users.noreply.github.com" \
  commit -am "vibe-test: scores $(date +%Y-%m-%d)"
git push origin master
# Verify: curl -s -o /dev/null -w "%{http_code}" https://github.com/facebook/astryx/wiki/Vibe-Test-Scores  → 200

Phase 7: Close the per-night run issue

Once the scores are on the wiki and any API concerns are on #3164, the per-night issue has served its purpose. Close it with a comment that links where its content went:

gh issue comment <N> --repo facebook/astryx --body "Processed by the Vibe Test Debugger.
- Scores → wiki ledger: https://github.com/facebook/astryx/wiki/Vibe-Test-Scores
- API concerns (if any) → #3164
- Fix PR (if any): #<PR>
Closing — the wiki ledger and #3164 are the permanent record."
gh issue close <N> --repo facebook/astryx

Closing the issue is also the idempotency signal: if the safety-net cron fires later and finds tonight's vibe-test-nightly issue already closed, it stops (NO_REPLY). Never close an issue before its scores are on the wiki.

Phase 8: Report

  • Fixes made → announce the PR number, what was fixed, before/after counts (e.g. "reproduced 3/3, resolved 3/3").
  • API comments only → what was commented on #3164 + reproduction count.
  • Investigated but not reproducible → note as model variance (no PR).
  • Always note: wiki ledger updated + per-night issue # closed.
  • If the per-night issue was already closed / not present → NO_REPLY.

Backwards-Validation Protocol (MUST FOLLOW)

The focused, reproducible vibe test you run twice: once to prove the bug (REPRODUCE, before the fix) and once to prove the fix (VALIDATE, after). Goal: MAXIMUM CONFIDENCE that the change — and only the change — fixed the failure.

The one rule that makes this trustworthy: reproduce and validate must be IDENTICAL in every way except the fix — same prompt IDs, same target, same persona (naive), same N, same fresh-agent procedure, same scoring. Before runs on clean origin/main (unfixed); after runs from the fixed+rebuilt worktree. Change two things at once and the result is noise.

Steps (on the xds node):

  1. Pick the EXACT failing prompt ID(s) from build-errors.json/universal.json. Same IDs both phases. Never substitute a "similar" prompt.
  2. Generate a fresh focused iteration for those prompts, persona naive:
    # REPRODUCE: from clean origin/main.  VALIDATE: from the fixed worktree.
    pnpm -F @astryxdesign/vibe-tests interactive --target astryx --persona naive --prompts <id1,id2> --label "<repro|validate>-<date>"
  3. Run N=3 independent reps (recommended: three separate iterations so scoring stays clean).
  4. Each rep is a FRESH, CONTEXT-FREE spawn_agent per (prompt × rep). The sub-agent gets ONLY the generated task prompt — nothing about the bug, the fix, the expected components, or which prop is "right." Never reuse an agent; never leak this session's reasoning. (Invariants #3 and #5.)
  5. Score each rep the same way: pnpm -F @astryxdesign/vibe-tests universal --iteration <iter>, then match the SAME error signature (same TS code, same missing/extra prop, same wrong import) — not just "did it fail."
  6. Thresholds: REPRODUCE pass = target error in ≥2/3; VALIDATE pass = target error in 0/3.
  7. Record evidence: before iteration IDs + verbatim error; after iteration IDs + confirmation gone.

Confidence checklist (all true before you trust a result):

  • Same prompt ID(s), persona (naive), target in both phases
  • N=3 fresh, context-free agents per phase (no reuse, no leaked answer)
  • Before on unfixed origin/main; after on fixed+rebuilt worktree
  • Only the fix differs between phases
  • Error matched by specific signature, not generic pass/fail
  • Reproduced ≥2/3 before; resolved 0/3 after
  • Iteration IDs + verbatim errors captured as evidence

Checker Protocol (MUST FOLLOW)

Re-read internal/vibe-tests/README.md "Checker Protocol" before changes. Never leak expected components into agent-visible docs (reproduce AND validate runs). Setup changes must keep fairness across ALL targets. CLI doc fixes must reflect ACTUAL TypeScript types, not aspirational APIs. Reproduce/validate sub-agents must be fresh and context-free.

Hard Rules

  • NEVER modify component source (packages/core/src/*) or TypeScript types.
  • ONLY fix: CLI .doc.mjs files, environment READMEs, or setup scripts.
  • API concerns: COMMENT on #3164 — don't fix, don't open a new issue.
  • Scores live on the wiki ledger — never file a per-night SCORES issue (the Runner's per-night RUN issue is a different, transient breadcrumb that you CLOSE after processing).
  • Append the wiki ledger WITHOUT breaking table formatting (no blank line between rows; see Phase 6).
  • NEVER ship a fix that didn't pass backwards validation: reproduce ≥2/3 BEFORE, resolve 0/3 AFTER.
  • Public repo: No AI attribution, no co-authored-by, no navibot URLs. Commits use Cindy's git author.
  • If no per-night issue is open for today (already closed / not run), do nothing (NO_REPLY).

Clone this wiki locally