Authoritative brief for agents working on Phase 2. Each swarm agent starts cold — everything they need is here.
The repo lives at /Users/jvadala/BettyAgent /parcc1. The parent directory BettyAgent has a literal trailing space. Always quote Bash paths. Dedicated tools (Read, Glob, Grep, Edit, Write) handle it fine.
- Next.js 15 + React 19 chat GUI, Claude Agent SDK backend
- Entry points: betty-ai-web/src/app/api/chat/route.ts → betty-ai-web/src/agent/server.ts
- Tools exposed (under betty-ai-web/src/agent/tools/):
wiki_search,wiki_read,wiki_write,gpu_calculate,cluster_run,cluster_submit,cluster_status,slurm_availability,slurm_check,slurm_diagnose,slurm_recommend - System prompt: betty-ai-web/src/agent/system-prompt.ts
- Wiki: 15 concepts, 11 entities, 5 models, 7 sources; schema at wiki/SCHEMA.md
- Python brain: betty-ai/ — model registry, GPU calculator, Slurm/DeepSpeed/training templates
- Fix the prompt/code mismatch where the agent is told it can "file experiments" but has no write tool.
- Add real cluster execution (read cluster state, submit sbatch jobs, poll status) over SSH with human-in-the-loop confirmation.
- Dogfood the whole system end-to-end on one realistic user journey.
| # | Decision | Value |
|---|---|---|
| D1 | SSH credential strategy | Shell out to ssh CLI, inherit user's Kerberos cache (kinit on host) |
| D2 | SSH connection handling | Pool one connection per server process, 30s keepalive, auto-reconnect on failure |
| D3 | Server topology | Runs on user's laptop, Penn VPN required |
| D4 | canUseTool tiers |
Tier 0 auto-approve: wiki reads, gpu_calculate, append-to-wiki/log.md. Tier 1 prompt once per session: read-only cluster commands in whitelist, wiki updates. Tier 2 always prompt: cluster_submit, wiki creates outside experiments/ |
| D5 | Job output streaming | Defer live streaming. Phase 2 = poll sacct + tail .out/.err |
| D6 | Experiment page ownership | Marker-delimited regions: agent owns ## Status / ## Runtime between <!-- betty:auto-start --> / <!-- betty:auto-end --> markers. User owns ## Goal / ## Lessons |
| D7 | Cluster-read whitelist | Start minimal: squeue, sinfo, parcc_*, ls /vast/..., cat of jvadala's .out/.err/.log files. Expand from dogfood findings. scancel deferred |
Each deliverable must include automated tests. If the repo has no test runner yet, set one up — vitest is the right choice for this Next.js + TS stack. Add:
betty-ai-web/vitest.config.ts"test": "vitest run"and"test:watch": "vitest"scripts in betty-ai-web/package.json- Colocated
*.test.tsfiles next to the unit under test (e.g.,wiki-write.test.ts)
Tests must cover:
- Happy path — the expected input/output contract
- Security boundaries — path traversal, whitelist bypass attempts, malformed inputs
- Error paths — what happens when the filesystem, SSH, or network fails
Before reporting done, every agent runs npm run typecheck and npm run test and confirms both pass.
Files:
- NEW
betty-ai-web/src/agent/tools/wiki-write.ts— mirror the traversal guard style in betty-ai-web/src/agent/tools/wiki-read.ts. Input:{ page: string, body: string, mode: 'create' | 'update' | 'append' }. Enforce: path resolves underpaths.wiki,.mdforced, frontmatter required oncreate, append-only forwiki/log.md. Forupdate, preserve user-owned sections; agent-owned content lives between<!-- betty:auto-start -->and<!-- betty:auto-end -->markers (D6). - NEW
betty-ai-web/src/agent/tools/wiki-write.test.ts— vitest unit tests: happy path (create, append, update), path traversal rejection (../, absolute, symlinked), missing frontmatter rejection, marker-region preservation. - MODIFY
betty-ai-web/src/agent/server.ts— import and registerwikiWriteTool, add toallowedTools. Add acanUseToolcallback per D4. ExportwriteWikiPage()helper so Track C can auto-log experiments. - MODIFY
betty-ai-web/src/app/api/chat/route.ts— extend SSE frame types to carry atool_permissionevent (a JSON frame the client can render as an Approve/Deny card). Keep backwards-compat with existingtextframes.
Acceptance:
- Writes outside
wiki/rejected by tests covering../, absolute paths, and symlinks. - Schema-lint rejects
createwithout required frontmatter (name,description,type). - Marker-region preservation verified by round-trip test.
- Tier 2 writes (creates outside
experiments/) firecanUseTool; Tier 1 writes (updates, experiments creates) firecanUseToolonce per session; appends towiki/log.mdauto-approve. - Exports
writeWikiPage(path, body, mode)for server-side use by Track C.
Files:
- NEW
wiki/experiments/TEMPLATE.md— frontmatter per wiki/SCHEMA.md, sections:## Goal(user),## Status(agent, marker-delimited),## Runtime(agent, marker-delimited),## Lessons(user). - NEW
wiki/experiments/.gitkeep - MODIFY
wiki/index.md— add "Experiments" section. - NEW/MODIFY chat UI component under
betty-ai-web/src/components/— render atool_permissionSSE frame as an Approve/Deny card. Post the result back via a newPOST /api/chat/permissionendpoint (add it to betty-ai-web/src/app/api/chat/route.ts or a new sibling route). - NEW
betty-ai-web/src/components/*.test.tsx— React Testing Library tests: card renders with tool name + args summary, Approve dispatches correct payload, Deny dispatches correct payload, disconnect during pending request fails closed.
Acceptance:
ls "wiki/experiments"shows TEMPLATE.md.- Manual browser test: asking the agent to file an experiment produces an approval card; click produces the right callback.
- UI unit tests pass.
Files:
- NEW
betty-ai-web/src/agent/cluster/ssh.ts— shell out tosshCLI (D1). Pooled single connection viassh -M -S <control-socket>ControlMaster (D2). Inherits Kerberos ticket fromkinitcache on host. Auto-reconnect on socket failure. ExportrunRemote(command): Promise<{stdout, stderr, exit}>anduploadFile(localPath | buffer, remotePath). - NEW
betty-ai-web/src/agent/cluster/whitelist.ts— regex allowlist (D7). ExportisSafeReadCommand(cmd): boolean. Also export the regex list so betty-ai-web/src/agent/system-prompt.ts can render it inline (single source of truth). - NEW
betty-ai-web/src/agent/cluster/*.test.ts— unit tests for whitelist (positive + adversarial cases:; rm -rf /, command injection via backticks, unicode lookalikes). - NEW
betty-ai-web/src/agent/tools/cluster-run.ts— MCP tool. Input:{ command: string }. Rejects non-whitelisted. Returns{stdout, stderr, exit}.readOnlyHint: true. Tier 1 per D4. - NEW
betty-ai-web/src/agent/tools/cluster-submit.ts— Input:{ script_body: string, sbatch_args?: string[], experiment_slug: string }. Uploads script to/vast/home/j/jvadala/.betty-ai/scripts/<slug>.sbatch, runssbatch, parses JobID. Tier 2 per D4. On success: calls Track A'swriteWikiPage()to createwiki/experiments/YYYY-MM-DD-<slug>.mdwith the script inlined andjob_idin frontmatter; appends a line towiki/log.md. Atomic: if sbatch fails, no wiki write; if wiki write fails, error surfaces JobID for recovery. - NEW
betty-ai-web/src/agent/tools/cluster-status.ts— Input:{ job_id: string }. Runssacct -j <id> --format=JobID,State,Elapsed,ExitCodeorsqueue -j <id>. Updates the matching experiment page's marker-delimited## Status/## Runtimesections via Track A's helper. - NEW
betty-ai-web/src/agent/tools/cluster-*.test.ts— mock SSH transport. Test: happy submit, rejected-command path, atomicity (sbatch fails → no wiki write), status update preserves user sections. - MODIFY
betty-ai-web/src/agent/server.ts— register the three tools inallowedToolsandbettyTools. ExtendcanUseTooltiers per D4.
Acceptance:
cluster_run "squeue -u jvadala"returns live output within 3s of warm connection (integration test, skipped in CI unlessBETTY_SSH_OK=1).cluster_run "rm -rf /"rejected at whitelist, never reaches SSH.- Submit atomicity verified by test.
- Status update preserves user-owned sections.
Files:
- MODIFY
betty-ai-web/src/agent/system-prompt.ts— remove "Phase 1: can only TALK about commands". Document the three cluster tools with tier info (D4). Import whitelist from Track C and render it inline. Add good/badcluster_runexamples. Document auto-logging of submissions. - MODIFY
wiki/SCHEMA.md— add "Machine-written experiment pages" section explaining marker regions (D6) and what frontmatter fields Betty AI sets automatically. - MAYBE MODIFY
.claude/agents/betty-ai.md— only if that file is still the source of truth for some path. Otherwise leave. - NEW
betty-ai-web/src/agent/system-prompt.test.ts— snapshot test thatbuildSystemPrompt()includes tool names fromallowedToolsand the whitelist patterns.
Acceptance:
- Prompt contains
cluster_run,cluster_submit,cluster_status,wiki_write. - Prompt no longer says "in this phase you can only TALK".
- Snapshot test passes.
Process:
- Fresh browser session. User prompt: "Help me fine-tune Llama 3 8B with LoRA on a 500-example test dataset — start from zero."
- Walk full journey: cluster state check → partition recommendation →
gpu_calculate→ draft sbatch → confirm → submit → poll → capture logs → file results. - At each step, capture: (a) hallucinations, (b) missing citations, (c) missing tools, (d) UX friction.
- Output:
wiki/experiments/2026-04-18-dogfood-llama3-8b-lora.md+raw/dogfood/2026-04-18-notes.mdwith ranked gap list (≥5 items, severity labeled).
Acceptance: End-to-end submission reaches a SLURM JobID; auto-generated wiki page exists; gap list produced.
Wave 1 (parallel):
Track A ── wiki_write tool + server.ts registration + SSE permission frame
Track B ── wiki seed + UI confirmation card (touches UI files only)
Track C.1 ─ SSH transport (ssh.ts + whitelist.ts, NO server.ts touch yet)
Sync S1 ── Wave 1 lands; A's writeWikiPage() helper exported; SSE protocol stable
Wave 2 (parallel):
Track C.2 ─ cluster-run/submit/status tools + server.ts registration
Track D ── system-prompt.ts + SCHEMA.md (references real tool names from A+C)
Sync S2 ── All tools live; integration tests green
Wave 3:
Track E ── interactive dogfood, produces backlog for Phase 2b
Conflict avoidance: Only Track A modifies server.ts in Wave 1. Track C.1 in Wave 1 only creates new files. Track C.2 in Wave 2 re-enters server.ts after A has landed. Track D in Wave 2 reads the final tool names.
- D1–D7 locked
- PLAN.md committed
- Wave 1 agents launched
- Wave 1 merged & tests green
- Wave 2 agents launched
- Wave 2 merged & tests green
- Dogfood journey complete