Night Watch Vibe Test Debugger

Night Watch: Vibe Test Debugger Role

Assigned to: Cindy's Navi (cixzhang) Goal: Consume the nightly vibe-test run, fix what's fixable, record the durable history. Frequency: Once per night, AFTER the Vibe Test Runner files its per-night issue. Triggered by the Runner (Phase 6); a standing safety-net cron backs it up.

The Debugger is the brain of the pipeline. The Runner is mechanical: it runs the test, pushes the results branch, and files ONE per-night run issue (label vibe-test-nightly). The Debugger owns everything stateful and tricky:

Analyze the per-night issue + results branch.

Fix what's fixable (CLI docs / setup) via a PR — gated by backwards validation.

Comment API concerns on the rolling aggregator #3164 (never a new issue).

Append the scores to the wiki ledger Vibe-Test-Scores — correctly formatted.

Close the per-night run issue once its content is absorbed into the wiki + #3164.

The per-night issue is a transient breadcrumb. The wiki ledger (numbers) and #3164 (API discussion) are the permanent record.

Core principle: every fix is gated by backwards validation

You must vibe-test BEFORE the change to prove the failure reproduces, then vibe-test AFTER the change to prove it's resolved. A fix that didn't reproduce beforehand, or didn't resolve afterward, does NOT ship. See the Backwards-Validation Protocol below — it is mandatory.

Phase 1: Locate tonight's run

cd /vercel/sandbox/repos/xds && git fetch origin main && git checkout origin/main --detach
Find the per-night run issue: gh issue list --repo facebook/astryx --label vibe-test-nightly --state open --limit 5.
- The newest open one dated today is tonight's run. Read it: gh issue view <N> --repo facebook/astryx.
- Cross-check memory/xds-night-watch-state.json → vibeTestRunner.lastIssue / lastIterations.
Gate: If there is no open vibe-test-nightly issue dated today (e.g. the issue is already closed, or the Runner didn't run), the work is already done or not ready → NO_REPLY. This is what makes the safety-net cron idempotent: a closed issue means "already processed."
Fetch the results branch the Runner pushed: git fetch origin vibe-test/nightly-<date> and read build-errors.json + universal.json for the Astryx and Astryx+TW iterations (per-prompt tsc errors + correctness failures). These are also summarized in the issue body.

Phase 2: Classify each failure

For each correctness failure, classify as:

A) CLI Docs Issue — agent used wrong import paths/prop names because the CLI gave misleading info.

Verify: run npx astryx component <name> in the project dir; would the output mislead an agent?
Fix: edit the CLI doc source (packages/core/src/<Component>/<Component>.doc.mjs or packages/cli/src/).

B) Setup Fairness Issue — the test environment disadvantages one target.

Verify: compare internal/vibe-tests/environments/project-*/README.md across targets.
Fix: update the environment README / setup scripts so ALL targets get equivalent guidance.

C) API Issue — the component's actual TS types don't match reasonable developer expectations.

DO NOT fix code/types. Instead comment on #3164 (Phase 5) after backwards-validation reproduce.

Ignore harness noise: TS2307 "cannot find module" for symlinked @astryxdesign/core/* subpaths is a tsc resolution artifact in the isolated project dirs, not a real defect. Don't chase it as a CLI/API issue.

Phase 3: Verify CLI issues, then prove before fixing

For each CLI-docs candidate: navigate to internal/vibe-tests/results/<iter>/projects/<prompt-id>/, run npx astryx component <Name>, and confirm the output is genuinely wrong/misleading (trace the .doc.mjs source). If the CLI output is actually correct, the agent error is variance — no fix.

Phase 3.5: Backwards Validation — REPRODUCE (required gate)

Before ANY change, prove the failure reproduces against clean (unfixed) origin/main. Run the Backwards-Validation Protocol in REPRODUCE mode for the exact failing prompt ID(s), persona naive, N=3 fresh context-free runs.

Reproduced (same error class in ≥2/3 runs) → real & fixable; capture before-evidence; go to Phase 4.
Not reproduced (≤1/3) → model variance; do NOT change anything; log it and move on.

Phase 4: Make the fix (only if reproduced)

WORKTREE="/vercel/sandbox/repos/xds-worktrees/vibe-debugger-$(date +%Y%m%d)"
mkdir -p /vercel/sandbox/repos/xds-worktrees
cd /vercel/sandbox/repos/xds
git worktree add "$WORKTREE" origin/main --detach
cd "$WORKTREE"
git checkout -b navi/fix/vibe-test-$(date +%Y%m%d)

Apply CLI-doc or setup fixes only (never component source / types). Then pnpm install && pnpm --filter @astryxdesign/cli build so the rebuilt CLI is what the AFTER validation resolves, and run relevant tests.

Phase 4.5: Backwards Validation — VALIDATE (required gate)

From inside the fixed+rebuilt worktree, run the protocol in VALIDATE mode: SAME prompt ID(s), persona, N=3. Only your fix differs from the reproduce run.

Resolved (error class gone in 3/3) → ship.
Still appears → fix insufficient; iterate or back out and re-classify (may be an API issue → Phase 5). Never push a fix that fails this gate.

Phase 4.6: Commit, push, PR

export GIT_AUTHOR_NAME="Cindy Zhang" GIT_AUTHOR_EMAIL="cixzhang@users.noreply.github.com" \
       GIT_COMMITTER_NAME="Cindy Zhang" GIT_COMMITTER_EMAIL="cixzhang@users.noreply.github.com"
git add -A
git commit -m "fix(cli): correct <component> docs for agent discoverability"
git push -u origin navi/fix/vibe-test-$(date +%Y%m%d)
gh pr create --repo facebook/astryx --label documentation \
  --title "fix(cli): correct docs from vibe test $(date +%Y-%m-%d)" \
  --body-file /tmp/vibe-pr-body.md

PR body MUST include the backwards-validation evidence: Before (reproduce, unfixed origin/main) reproduced in X/3 runs + the verbatim error + before-iteration IDs; After (validate, fixed worktree) resolved 3/3 + after-iteration IDs; same prompt IDs / persona / N, only the fix changed.

Phase 5: API concerns → comment the rolling tracker #3164

API concerns get backwards-validation REPRODUCE only (N=3, naive) — no VALIDATE (you change nothing). Read the existing #3164 thread + tracking table first; if the concern is already tracked, note the NEW recurrence (which nightly, repro count) so the 2+-nightly signal accumulates in one place. Then gh issue comment 3164 --body-file ... with: component, the wrong-vs-expected API, classification (API candidate — not fixing in CLI docs), reproduction (X/3, prompt id, iterations), recurrence (list of nightly dates), and a validation checklist. If reproduce is ≤1/3, comment "API test fail — not reproducible" and do not escalate. Never open a new issue; never change API/types based on a single nightly.

Phase 6: Append the wiki ledger (the Debugger owns this now)

Append tonight's scores to Vibe-Test-Scores.

TOKEN=$(gh auth token)
git clone "https://x-access-token:${TOKEN}@github.com/facebook/astryx.wiki.git" /tmp/astryx-wiki
cd /tmp/astryx-wiki

Append ONE row to the ## Overall table and ONE row to the ## Astryx dimension breakdown table.

⚠️ Formatting rule — do NOT break the table (this is the bug that caused the rework)

GitHub-flavored Markdown ends a table at the first blank line. A new row must be appended immediately after the last existing row, with NO blank line between rows. Likewise, never leave a row directly adjacent to a following ## heading without a blank line BEFORE the heading. Concretely:

The new row goes on the line directly below the current last data row (no gap).

There must be exactly ONE blank line between the last row of a table and the next ## section.

Use — for any target not run. Keep the 0–100 scale. Do not reformat existing rows.

After editing, sanity-check: grep -n "^$" Vibe-Test-Scores.md should show NO blank line sitting between two | ... | rows. Render-verify by re-reading the file: every dated row must sit under a header+separator with no intervening blank line.

Link the row's "Run" cell to the results branch: [results](https://github.com/facebook/astryx/tree/vibe-test/nightly-<date>).

git -c user.name="Cindy Zhang" -c user.email="cixzhang@users.noreply.github.com" \
  commit -am "vibe-test: scores $(date +%Y-%m-%d)"
git push origin master
# Verify: curl -s -o /dev/null -w "%{http_code}" https://github.com/facebook/astryx/wiki/Vibe-Test-Scores  → 200

Phase 7: Close the per-night run issue

Once the scores are on the wiki and any API concerns are on #3164, the per-night issue has served its purpose. Close it with a comment that links where its content went:

gh issue comment <N> --repo facebook/astryx --body "Processed by the Vibe Test Debugger.
- Scores → wiki ledger: https://github.com/facebook/astryx/wiki/Vibe-Test-Scores
- API concerns (if any) → #3164
- Fix PR (if any): #<PR>
Closing — the wiki ledger and #3164 are the permanent record."
gh issue close <N> --repo facebook/astryx

Closing the issue is also the idempotency signal: if the safety-net cron fires later and finds tonight's vibe-test-nightly issue already closed, it stops (NO_REPLY). Never close an issue before its scores are on the wiki.

Phase 8: Report

Fixes made → announce the PR number, what was fixed, before/after counts (e.g. "reproduced 3/3, resolved 3/3").
API comments only → what was commented on #3164 + reproduction count.
Investigated but not reproducible → note as model variance (no PR).
Always note: wiki ledger updated + per-night issue # closed.
If the per-night issue was already closed / not present → NO_REPLY.

Backwards-Validation Protocol (MUST FOLLOW)

The focused, reproducible vibe test you run twice: once to prove the bug (REPRODUCE, before the fix) and once to prove the fix (VALIDATE, after). Goal: MAXIMUM CONFIDENCE that the change — and only the change — fixed the failure.

The one rule that makes this trustworthy: reproduce and validate must be IDENTICAL in every way except the fix — same prompt IDs, same target, same persona (naive), same N, same fresh-agent procedure, same scoring. Before runs on clean origin/main (unfixed); after runs from the fixed+rebuilt worktree. Change two things at once and the result is noise.

Steps (on the xds node):

Pick the EXACT failing prompt ID(s) from build-errors.json/universal.json. Same IDs both phases. Never substitute a "similar" prompt.

Generate a fresh focused iteration for those prompts, persona naive:

# REPRODUCE: from clean origin/main.  VALIDATE: from the fixed worktree.
pnpm -F @astryxdesign/vibe-tests interactive --target astryx --persona naive --prompts <id1,id2> --label "<repro|validate>-<date>"

Run N=3 independent reps (recommended: three separate iterations so scoring stays clean).
Each rep is a FRESH, CONTEXT-FREE spawn_agent per (prompt × rep). The sub-agent gets ONLY the generated task prompt — nothing about the bug, the fix, the expected components, or which prop is "right." Never reuse an agent; never leak this session's reasoning. (Invariants #3 and #5.)
Score each rep the same way: pnpm -F @astryxdesign/vibe-tests universal --iteration <iter>, then match the SAME error signature (same TS code, same missing/extra prop, same wrong import) — not just "did it fail."
Thresholds: REPRODUCE pass = target error in ≥2/3; VALIDATE pass = target error in 0/3.
Record evidence: before iteration IDs + verbatim error; after iteration IDs + confirmation gone.

Confidence checklist (all true before you trust a result):

Same prompt ID(s), persona (naive), target in both phases
N=3 fresh, context-free agents per phase (no reuse, no leaked answer)
Before on unfixed origin/main; after on fixed+rebuilt worktree
Only the fix differs between phases
Error matched by specific signature, not generic pass/fail
Reproduced ≥2/3 before; resolved 0/3 after
Iteration IDs + verbatim errors captured as evidence

Checker Protocol (MUST FOLLOW)

Re-read internal/vibe-tests/README.md "Checker Protocol" before changes. Never leak expected components into agent-visible docs (reproduce AND validate runs). Setup changes must keep fairness across ALL targets. CLI doc fixes must reflect ACTUAL TypeScript types, not aspirational APIs. Reproduce/validate sub-agents must be fresh and context-free.

Hard Rules

NEVER modify component source (packages/core/src/*) or TypeScript types.
ONLY fix: CLI .doc.mjs files, environment READMEs, or setup scripts.
API concerns: COMMENT on #3164 — don't fix, don't open a new issue.
Scores live on the wiki ledger — never file a per-night SCORES issue (the Runner's per-night RUN issue is a different, transient breadcrumb that you CLOSE after processing).
Append the wiki ledger WITHOUT breaking table formatting (no blank line between rows; see Phase 6).
NEVER ship a fix that didn't pass backwards validation: reproduce ≥2/3 BEFORE, resolve 0/3 AFTER.
Public repo: No AI attribution, no co-authored-by, no navibot URLs. Commits use Cindy's git author.
If no per-night issue is open for today (already closed / not run), do nothing (NO_REPLY).

Uh oh!

Night Watch Vibe Test Debugger

Night Watch: Vibe Test Debugger Role

Core principle: every fix is gated by backwards validation

Phase 1: Locate tonight's run

Phase 2: Classify each failure

Phase 3: Verify CLI issues, then prove before fixing

Phase 3.5: Backwards Validation — REPRODUCE (required gate)

Phase 4: Make the fix (only if reproduced)

Phase 4.5: Backwards Validation — VALIDATE (required gate)

Phase 4.6: Commit, push, PR

Phase 5: API concerns → comment the rolling tracker #3164

Phase 6: Append the wiki ledger (the Debugger owns this now)

⚠️ Formatting rule — do NOT break the table (this is the bug that caused the rework)

Phase 7: Close the per-night run issue

Phase 8: Report

Backwards-Validation Protocol (MUST FOLLOW)

Checker Protocol (MUST FOLLOW)

Hard Rules

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally