-
Notifications
You must be signed in to change notification settings - Fork 234
Night Watch Vibe Test Debugger
Assigned to: Cindy's Navi (cixzhang)
Goal: Consume the nightly vibe-test run, fix what's fixable, record the durable history.
Frequency: Once per night, AFTER the Vibe Test Runner files its per-night issue. Triggered by the Runner (Phase 6); a standing safety-net cron backs it up.
The Debugger is the brain of the pipeline. The Runner is mechanical: it runs the test, pushes the results branch, and files ONE per-night run issue (label
vibe-test-nightly). The Debugger owns everything stateful and tricky:
- Analyze the per-night issue + results branch.
- Fix what's fixable (CLI docs / setup) via a PR — gated by backwards validation.
- Comment API concerns on the rolling aggregator #3164 (never a new issue).
- Append the scores to the wiki ledger Vibe-Test-Scores — correctly formatted.
- Close the per-night run issue once its content is absorbed into the wiki + #3164.
The per-night issue is a transient breadcrumb. The wiki ledger (numbers) and #3164 (API discussion) are the permanent record.
You must vibe-test BEFORE the change to prove the failure reproduces, then vibe-test AFTER the change to prove it's resolved. A fix that didn't reproduce beforehand, or didn't resolve afterward, does NOT ship. See the Backwards-Validation Protocol below — it is mandatory.
cd /vercel/sandbox/repos/xds && git fetch origin main && git checkout origin/main --detach- Find the per-night run issue:
gh issue list --repo facebook/astryx --label vibe-test-nightly --state open --limit 5.- The newest open one dated today is tonight's run. Read it:
gh issue view <N> --repo facebook/astryx. - Cross-check
memory/xds-night-watch-state.json→vibeTestRunner.lastIssue/lastIterations.
- The newest open one dated today is tonight's run. Read it:
-
Gate: If there is no open
vibe-test-nightlyissue dated today (e.g. the issue is already closed, or the Runner didn't run), the work is already done or not ready → NO_REPLY. This is what makes the safety-net cron idempotent: a closed issue means "already processed." - Fetch the results branch the Runner pushed:
git fetch origin vibe-test/nightly-<date>and readbuild-errors.json+universal.jsonfor the Astryx and Astryx+TW iterations (per-prompt tsc errors + correctness failures). These are also summarized in the issue body.
For each correctness failure, classify as:
A) CLI Docs Issue — agent used wrong import paths/prop names because the CLI gave misleading info.
- Verify: run
npx astryx component <name>in the project dir; would the output mislead an agent? - Fix: edit the CLI doc source (
packages/core/src/<Component>/<Component>.doc.mjsorpackages/cli/src/).
B) Setup Fairness Issue — the test environment disadvantages one target.
- Verify: compare
internal/vibe-tests/environments/project-*/README.mdacross targets. - Fix: update the environment README / setup scripts so ALL targets get equivalent guidance.
C) API Issue — the component's actual TS types don't match reasonable developer expectations.
- DO NOT fix code/types. Instead comment on #3164 (Phase 5) after backwards-validation reproduce.
Ignore harness noise: TS2307 "cannot find module" for symlinked
@astryxdesign/core/*subpaths is a tsc resolution artifact in the isolated project dirs, not a real defect. Don't chase it as a CLI/API issue.
For each CLI-docs candidate: navigate to internal/vibe-tests/results/<iter>/projects/<prompt-id>/,
run npx astryx component <Name>, and confirm the output is genuinely wrong/misleading (trace the
.doc.mjs source). If the CLI output is actually correct, the agent error is variance — no fix.
Before ANY change, prove the failure reproduces against clean (unfixed) origin/main. Run the
Backwards-Validation Protocol in REPRODUCE mode for the exact failing prompt ID(s), persona
naive, N=3 fresh context-free runs.
- Reproduced (same error class in ≥2/3 runs) → real & fixable; capture before-evidence; go to Phase 4.
- Not reproduced (≤1/3) → model variance; do NOT change anything; log it and move on.
WORKTREE="/vercel/sandbox/repos/xds-worktrees/vibe-debugger-$(date +%Y%m%d)"
mkdir -p /vercel/sandbox/repos/xds-worktrees
cd /vercel/sandbox/repos/xds
git worktree add "$WORKTREE" origin/main --detach
cd "$WORKTREE"
git checkout -b navi/fix/vibe-test-$(date +%Y%m%d)Apply CLI-doc or setup fixes only (never component source / types). Then
pnpm install && pnpm --filter @astryxdesign/cli build so the rebuilt CLI is what the AFTER
validation resolves, and run relevant tests.
From inside the fixed+rebuilt worktree, run the protocol in VALIDATE mode: SAME prompt ID(s), persona, N=3. Only your fix differs from the reproduce run.
- Resolved (error class gone in 3/3) → ship.
- Still appears → fix insufficient; iterate or back out and re-classify (may be an API issue → Phase 5). Never push a fix that fails this gate.
export GIT_AUTHOR_NAME="Cindy Zhang" GIT_AUTHOR_EMAIL="cixzhang@users.noreply.github.com" \
GIT_COMMITTER_NAME="Cindy Zhang" GIT_COMMITTER_EMAIL="cixzhang@users.noreply.github.com"
git add -A
git commit -m "fix(cli): correct <component> docs for agent discoverability"
git push -u origin navi/fix/vibe-test-$(date +%Y%m%d)
gh pr create --repo facebook/astryx --label documentation \
--title "fix(cli): correct docs from vibe test $(date +%Y-%m-%d)" \
--body-file /tmp/vibe-pr-body.mdPR body MUST include the backwards-validation evidence: Before (reproduce, unfixed origin/main)
reproduced in X/3 runs + the verbatim error + before-iteration IDs; After (validate, fixed
worktree) resolved 3/3 + after-iteration IDs; same prompt IDs / persona / N, only the fix changed.
API concerns get backwards-validation REPRODUCE only (N=3, naive) — no VALIDATE (you change
nothing). Read the existing #3164 thread + tracking table first; if the concern is already
tracked, note the NEW recurrence (which nightly, repro count) so the 2+-nightly signal accumulates
in one place. Then gh issue comment 3164 --body-file ... with: component, the wrong-vs-expected
API, classification (API candidate — not fixing in CLI docs), reproduction (X/3, prompt id,
iterations), recurrence (list of nightly dates), and a validation checklist. If reproduce is
≤1/3, comment "API test fail — not reproducible" and do not escalate. Never open a new issue;
never change API/types based on a single nightly.
Append tonight's scores to Vibe-Test-Scores.
TOKEN=$(gh auth token)
git clone "https://x-access-token:${TOKEN}@github.com/facebook/astryx.wiki.git" /tmp/astryx-wiki
cd /tmp/astryx-wikiAppend ONE row to the ## Overall table and ONE row to the ## Astryx dimension breakdown table.
GitHub-flavored Markdown ends a table at the first blank line. A new row must be appended immediately after the last existing row, with NO blank line between rows. Likewise, never leave a row directly adjacent to a following
##heading without a blank line BEFORE the heading. Concretely:
- The new row goes on the line directly below the current last data row (no gap).
- There must be exactly ONE blank line between the last row of a table and the next
##section.- Use
—for any target not run. Keep the 0–100 scale. Do not reformat existing rows.After editing, sanity-check:
grep -n "^$" Vibe-Test-Scores.mdshould show NO blank line sitting between two| ... |rows. Render-verify by re-reading the file: every dated row must sit under a header+separator with no intervening blank line.
Link the row's "Run" cell to the results branch: [results](https://github.com/facebook/astryx/tree/vibe-test/nightly-<date>).
git -c user.name="Cindy Zhang" -c user.email="cixzhang@users.noreply.github.com" \
commit -am "vibe-test: scores $(date +%Y-%m-%d)"
git push origin master
# Verify: curl -s -o /dev/null -w "%{http_code}" https://github.com/facebook/astryx/wiki/Vibe-Test-Scores → 200Once the scores are on the wiki and any API concerns are on #3164, the per-night issue has served its purpose. Close it with a comment that links where its content went:
gh issue comment <N> --repo facebook/astryx --body "Processed by the Vibe Test Debugger.
- Scores → wiki ledger: https://github.com/facebook/astryx/wiki/Vibe-Test-Scores
- API concerns (if any) → #3164
- Fix PR (if any): #<PR>
Closing — the wiki ledger and #3164 are the permanent record."
gh issue close <N> --repo facebook/astryxClosing the issue is also the idempotency signal: if the safety-net cron fires later and finds tonight's
vibe-test-nightlyissue already closed, it stops (NO_REPLY). Never close an issue before its scores are on the wiki.
- Fixes made → announce the PR number, what was fixed, before/after counts (e.g. "reproduced 3/3, resolved 3/3").
- API comments only → what was commented on #3164 + reproduction count.
- Investigated but not reproducible → note as model variance (no PR).
- Always note: wiki ledger updated + per-night issue # closed.
- If the per-night issue was already closed / not present → NO_REPLY.
The focused, reproducible vibe test you run twice: once to prove the bug (REPRODUCE, before the fix) and once to prove the fix (VALIDATE, after). Goal: MAXIMUM CONFIDENCE that the change — and only the change — fixed the failure.
The one rule that makes this trustworthy: reproduce and validate must be IDENTICAL in every
way except the fix — same prompt IDs, same target, same persona (naive), same N, same fresh-agent
procedure, same scoring. Before runs on clean origin/main (unfixed); after runs from the
fixed+rebuilt worktree. Change two things at once and the result is noise.
Steps (on the xds node):
- Pick the EXACT failing prompt ID(s) from
build-errors.json/universal.json. Same IDs both phases. Never substitute a "similar" prompt. - Generate a fresh focused iteration for those prompts, persona
naive:# REPRODUCE: from clean origin/main. VALIDATE: from the fixed worktree. pnpm -F @astryxdesign/vibe-tests interactive --target astryx --persona naive --prompts <id1,id2> --label "<repro|validate>-<date>"
- Run N=3 independent reps (recommended: three separate iterations so scoring stays clean).
- Each rep is a FRESH, CONTEXT-FREE
spawn_agentper (prompt × rep). The sub-agent gets ONLY the generated task prompt — nothing about the bug, the fix, the expected components, or which prop is "right." Never reuse an agent; never leak this session's reasoning. (Invariants #3 and #5.) - Score each rep the same way:
pnpm -F @astryxdesign/vibe-tests universal --iteration <iter>, then match the SAME error signature (same TS code, same missing/extra prop, same wrong import) — not just "did it fail." - Thresholds: REPRODUCE pass = target error in ≥2/3; VALIDATE pass = target error in 0/3.
- Record evidence: before iteration IDs + verbatim error; after iteration IDs + confirmation gone.
Confidence checklist (all true before you trust a result):
- Same prompt ID(s), persona (
naive), target in both phases - N=3 fresh, context-free agents per phase (no reuse, no leaked answer)
- Before on unfixed
origin/main; after on fixed+rebuilt worktree - Only the fix differs between phases
- Error matched by specific signature, not generic pass/fail
- Reproduced ≥2/3 before; resolved 0/3 after
- Iteration IDs + verbatim errors captured as evidence
Re-read internal/vibe-tests/README.md "Checker Protocol" before changes. Never leak expected
components into agent-visible docs (reproduce AND validate runs). Setup changes must keep fairness
across ALL targets. CLI doc fixes must reflect ACTUAL TypeScript types, not aspirational APIs.
Reproduce/validate sub-agents must be fresh and context-free.
- NEVER modify component source (
packages/core/src/*) or TypeScript types. - ONLY fix: CLI
.doc.mjsfiles, environment READMEs, or setup scripts. - API concerns: COMMENT on #3164 — don't fix, don't open a new issue.
- Scores live on the wiki ledger — never file a per-night SCORES issue (the Runner's per-night RUN issue is a different, transient breadcrumb that you CLOSE after processing).
- Append the wiki ledger WITHOUT breaking table formatting (no blank line between rows; see Phase 6).
- NEVER ship a fix that didn't pass backwards validation: reproduce ≥2/3 BEFORE, resolve 0/3 AFTER.
- Public repo: No AI attribution, no co-authored-by, no navibot URLs. Commits use Cindy's git author.
- If no per-night issue is open for today (already closed / not run), do nothing (NO_REPLY).