A minimalist, Cursor-style workspace for autonomous materials-science R&D campaigns.
A campaign is the repo. Candidate materials are files. Experiment plans are PRs. Simulations / lab runs are CI. The agent panel is Cursor's agent mode. The scientist reviews and approves experiment batches; agents quietly advance the work in the background.
This is a real working v1, not a mockup. Real Postgres-style persistence (SQLite v1), a real durable background worker, ten typed multi-agent specialists, swappable LLM providers (OpenAI / Anthropic / deterministic local fallback), pluggable adapter interfaces for literature / patent / simulation / lab / instrument / notification, and a real Experiment-PR human-approval flow.
Above: a live campaign ("Find a cobalt-free cathode with 15% lower cost than NMC811, >200 mAh/g, >1000 cycles, slurry-coating compatible"), driven end-to-end by
claude-sonnet-4-5. Eight real cathode candidates generated, scored, safety-checked, IP-checked, packaged into an Experiment PR, approved by a human, simulated, re-ranked, summarized, next-action recommended. Every row is persisted. Every agent step is auditable. The right pane is the agent timeline — orchestrator → simulations → results-analysis → report → next plan — all real LLM calls.
npm install
npm run db:push # creates SQLite schema (file: ./dev.db)
npm run db:seed # seeds 4 sample campaigns
npm run dev:all # web (3000) + worker concurrentlyOpen http://localhost:3000. Pick a seeded campaign, type a command into the agent panel ("Generate the next 4 candidates and draft an experiment PR"), watch the orchestrator plan → search literature → generate candidates → score / safety / IP → create an Experiment PR. Open the PR, click Approve, watch simulations queue, results land, and candidate rankings refresh.
If you prefer separate terminals:
npm run dev # Next.js on :3000
npm run worker # background worker in another shellOther scripts:
npm test # vitest (unit + API): scoring, simulation, schemas, orchestrator
npm run test:e2e # playwright smoke (start dev + worker first)
npm run typecheck # tsc --noEmit
npm run build # production build
npm run db:reset # nuke + reseed the local DBmatter-agent/
app/ Next.js 14 App Router pages + API routes
api/ campaigns, candidates, experiments, results, evidence, prs,
runs (incl. SSE /runs/[id]/stream), settings
campaigns/[id] 3-pane workspace (sidebar / center / agent panel)
campaigns/[id]/prs Experiment-PR review screen
settings provider / model / adapter / queue status
agents/ Typed multi-agent system
types.ts zod schemas for every agent IO
runner.ts AgentRunner — persists tasks + events, retries, LLM accounting
specialists.ts 10 agent definitions (system prompts + schemas)
orchestrator.ts long-horizon orchestration: plan → lit → gen → score → safety
→ IP → design → Experiment PR (waiting_approval)
+ post-approval finalization
simulation.ts experiment runner that calls the simulation adapter
scoring.ts deterministic overall-score function + sub-score derivation
llm/ swappable LLM provider layer
provider.ts interface + factory + LLMCall accounting
openai.ts OpenAI Chat Completions JSON-mode wrapper
anthropic.ts Anthropic Messages JSON wrapper
deterministic.ts seeded local stub that produces schema-valid output from
request hash, sampling from per-domain tables. Lets the
app run end-to-end with no API keys.
adapters/ pluggable real-world interfaces
types.ts LiteratureSourceAdapter, PatentSearchAdapter, SimulationAdapter,
LabAdapter, InstrumentResultAdapter, NotificationAdapter
literature/ local-corpus literature search
patents/ local-corpus patent search
simulation/ deterministic property-prediction simulator (seeded, reproducible)
lab/ manual-lab placeholder (creates queued manual tasks)
notifications/ console notifier
registry.ts the single place to swap real adapters in
lib/ db, queue (DB-backed durable), config, api helpers, serializers
worker/ separate Node process (BullMQ-style API, DB-backed)
index.ts workerId, tick loop, claim → process → finish/fail
processors/ run-orchestrator, run-experiment, campaign-loop
prisma/
schema.prisma all domain tables (campaigns, candidates, experiment_prs,
experiments, results, evidence, agent_runs, agent_tasks,
agent_events, llm_calls, settings, job)
seed.ts 4 sample campaigns (cathode, electrolyte, ammonia catalyst,
anti-corrosion coating)
data/corpus/ local literature + patent JSON corpora
components/ AppShell, CampaignSidebar, CampaignHeader, ArtifactTabs,
AgentPanel, CandidateTable, CandidateDetail, ExperimentQueue,
ResultTable, EvidenceList, ObjectiveView, ExperimentPRReview,
SettingsPanel, StatusBadge
tests/
unit/ scoring, simulation, schemas, orchestrator (state-machine)
api/ campaigns route
e2e/ Playwright smoke: create campaign → command → approve PR
→ see results
User command
→ POST /api/runs (AgentRun row created, status=queued)
→ enqueue 'orchestrator' job
worker.tick()
→ orchestrate()
→ OrchestratorAgent.plan
→ LiteratureAgent (local corpus)
→ CandidateGenerationAgent → Candidate rows
→ ScoringAgent + SafetyAgent + IPNoveltyAgent → updates Candidate rows
→ ExperimentDesignAgent → ExperimentPR (approval_status=pending)
→ AgentRun.status = waiting_approval (BLOCKS for human approval)
User approves PR
→ orchestratePostApproval() creates Experiment rows + enqueues 'experiment' jobs
worker.tick()
→ runExperiment() per Experiment
→ SimulationAdapter.simulate() → Result row + Evidence row
→ updates Candidate.predictedProperties + sub-scores + overallScore
→ when all PR experiments complete
→ ResultsAnalysisAgent → candidate updates + next-action recommendation
→ ReportAgent → AgentRun.outputSummary
→ Campaign.status = active (loop continues)
Every agent IO is validated with zod. Failed validation retries once, then the
task is marked failed and surfaces in the agent panel as a red event. Every LLM
call writes an LLMCall row with provider, model, prompt hash, input/output
summaries, and latency. Raw model output never directly mutates the DB.
A real GitHub-PR-style review surface: goal, proposed experiments per candidate, cost / runtime / risk, information-gain estimate, the evidence the agent used, and Approve / Edit / Reject / Ask Agent. Approval is the only thing that lets the worker spend real (simulated) money — every expensive action is gated on a human.
When experiments finish, raw measurements (specific capacity, cycle life, cost
per kWh, thermal runaway, manufacturability, cobalt fraction) are persisted as
Result rows with Evidence provenance, and propagated back into per-candidate
sub-scores and the overall ranking.
A campaign-loop repeatable job ticks every 30s. For each active/running
campaign with no pending PR and no in-flight experiments, it auto-enqueues an
orchestrator run to propose the next experiment PR. The loop stops at every
approval boundary — risky / expensive actions always require a human.
All settings have safe defaults. Copy .env.example to .env and adjust:
| Variable | Purpose | Default |
|---|---|---|
DATABASE_URL |
Prisma datasource | file:./dev.db |
DEFAULT_LLM_PROVIDER |
openai / anthropic / deterministic |
deterministic |
OPENAI_API_KEY |
Optional — switch provider in /settings | empty |
OPENAI_MODEL |
Model id string (no hard-coded versions) | gpt-4o-mini |
ANTHROPIC_API_KEY |
Optional | empty |
ANTHROPIC_MODEL |
Model id string | claude-3-5-sonnet-latest |
APP_URL |
Used for SSR URLs / Playwright | http://localhost:3000 |
WORKER_TICK_MS |
Worker polling interval | 1500 |
When no API keys are present the system runs in Local Deterministic Mode — the
deterministic provider produces structured plausible outputs by hashing the prompt
- sampling per-domain tables. The UI surfaces this clearly in the agent panel and the Settings page. Without keys the app is fully functional end to end.
For practical local-first runnability the v1 swaps two infrastructure choices:
- SQLite via Prisma instead of Postgres. The schema is identical; switching
to Postgres only requires changing
datasource.providerand theDATABASE_URL. - DB-backed durable queue (the
Jobtable +lib/queue.ts) instead of BullMQ + Redis. The surface (enqueue / claimNextJob / finishJob / failJob) is intentionally BullMQ-shaped so swapping is mechanical. Worker is still a fully separate Node process; state survives restarts.
Both choices preserve every architectural property the spec required (real persistence, separate worker process, durable jobs) with zero external infra. Postgres + Redis + BullMQ is the recommended production swap.
Everything else — multi-agent orchestration, LLM provider abstraction, adapter interfaces, Experiment-PR approval gating, long-horizon loop, real persisted state, Cursor-style UX — is implemented as specified.
- Single-user, no auth. Local-first per spec.
- Deterministic simulation adapter produces plausible but not scientifically validated property predictions. The UI labels results as "simulated, local adapter".
- Real lab / instrument / external literature / external patent integrations are stub interfaces only — never present their output as real wet-lab data.
- Safety / IP / novelty outputs are advisory, not legal or hazard advice.
- The autonomous loop only auto-advances campaigns that have at least one candidate already; bootstrap requires a human command.
npm test
> vitest run
✓ tests/unit/scoring.test.ts (6 tests)
✓ tests/unit/simulation.test.ts (4 tests)
✓ tests/unit/schemas.test.ts (5 tests)
✓ tests/unit/orchestrator.test.ts (2 tests, ~100ms total)
✓ tests/api/campaigns.test.ts (2 tests)
Test Files 5 passed (5)
Tests 19 passed (19)
npm run test:e2e
✓ tests/e2e/smoke.spec.ts (create campaign → command → PR → approve → results)
1 passed (~8s)
E2E smoke captures the full flow locally as screenshots under
tests/e2e/screenshots/01..05-*.png (workspace with pending PR, PR review,
results, final workspace, settings) — these are gitignored, the curated set
under docs/ is what ships with the repo.
Also exercised end-to-end against the real claude-sonnet-4-5 (resolves to
claude-sonnet-4-5-20250929):
campaign: "Live cobalt-free cathode" (15% lower cost than NMC811, >200 mAh/g, >1000 cycles)
command: "Generate 4 diverse cobalt-free cathode candidates and draft an
Experiment PR under $20,000 focused on cycle life."
orchestrator anthropic/claude-sonnet-4-5 46.2s 7-step plan
literature anthropic/claude-sonnet-4-5 44.4s 5 takeaways persisted as Evidence
candidate-gen anthropic/claude-sonnet-4-5 145.0s 8 real candidates: LFP-High-Purity,
LMFP-Mn-Doped, LFMP-Carbon-Coated,
LNM-Layered, NCA-Cobalt-Free,
NM-Disordered-Rock-Salt, LMO-Spinel,
LNMO-High-Voltage (real chemistries +
real rationale)
scoring anthropic/claude-sonnet-4-5 39.0s
safety anthropic/claude-sonnet-4-5 33.8s
ip-novelty anthropic/claude-sonnet-4-5 60.0s
experiment-design anthropic/claude-sonnet-4-5 36.1s ExperimentPR "High-Performance LFP/LMFP
Validation and Optimization Batch",
5 experiments, $60k, low risk
report anthropic/claude-sonnet-4-5 6.2s campaign summary
↓ human approves PR ↓
simulation x5 deterministic local adapter <3s real per-candidate measurements:
LFP-High-Purity 180.4 mAh/g, 1116 cycles, $86.3/kWh
LMFP-Mn-Doped 211.3 mAh/g, 834 cycles, $94.6/kWh
LFMP-Carbon-C. 173.6 mAh/g, 1006 cycles, $81.2/kWh
(all cobalt-free, candidate sub-scores
refreshed, overall scores re-ranked)
results-analysis anthropic/claude-sonnet-4-5 26.2s candidate updates + next-action rec
report anthropic/claude-sonnet-4-5 5.2s campaign summary:
"...successfully identified LFP-High-Purity
as the top-performing candidate,
demonstrating superior electrochemical
performance metrics. LFMP-Carbon-Coated
emerged as a strong secondary candidate,
warranting direct comparison studies..."
orchestrator → "Initiate scale-up feasibility study for LFP-High-Purity (top performer)
and conduct head-to-head comparison..."
end-to-end wall clock: ~7 min orchestration + <3s simulation + ~31s analysis = ~8 min
The four screenshots in this README (docs/hero-workspace.png, docs/pr-review.png,
docs/results.png, docs/settings.png) are all captured from this exact live run.
- Open the app — campaign sidebar populated, agent panel ready.
- Create a campaign (name, objective, domain) from the sidebar.
- Type a command (e.g. "Generate the next 4 candidates and draft an experiment PR under $20,000") — an AgentRun appears in the right panel.
- Background agents run; the timeline updates: orchestrator plan → literature search → candidate generation → scoring → safety → IP novelty → experiment design.
- An Experiment PR appears; the workspace shows a banner; the agent run
transitions to
waiting_approval. - Open the PR (Cursor-diff style review with goal, proposed experiments, cost / runtime / risk, evidence, Approve / Edit / Reject / Ask Agent).
- Approve. Experiments are created with status
queued. The worker picks them up, the simulation adapter runs, Result rows are persisted, Candidate scores refresh. - Results-analysis + Report agents summarize what changed and recommend the
next action. Campaign goes back to
active. The autonomous loop will propose the next PR.
All state survives restarts. npm run db:reset to start over.



