Skip to content

arnavbathla/cursor-for-material-R-D

Repository files navigation

Matter Agent

A minimalist, Cursor-style workspace for autonomous materials-science R&D campaigns.

A campaign is the repo. Candidate materials are files. Experiment plans are PRs. Simulations / lab runs are CI. The agent panel is Cursor's agent mode. The scientist reviews and approves experiment batches; agents quietly advance the work in the background.

This is a real working v1, not a mockup. Real Postgres-style persistence (SQLite v1), a real durable background worker, ten typed multi-agent specialists, swappable LLM providers (OpenAI / Anthropic / deterministic local fallback), pluggable adapter interfaces for literature / patent / simulation / lab / instrument / notification, and a real Experiment-PR human-approval flow.

Matter Agent workspace, driven end-to-end by Claude Sonnet 4.5 against a live cobalt-free cathode campaign

Above: a live campaign ("Find a cobalt-free cathode with 15% lower cost than NMC811, >200 mAh/g, >1000 cycles, slurry-coating compatible"), driven end-to-end by claude-sonnet-4-5. Eight real cathode candidates generated, scored, safety-checked, IP-checked, packaged into an Experiment PR, approved by a human, simulated, re-ranked, summarized, next-action recommended. Every row is persisted. Every agent step is auditable. The right pane is the agent timeline — orchestrator → simulations → results-analysis → report → next plan — all real LLM calls.


Quick start

npm install
npm run db:push      # creates SQLite schema (file: ./dev.db)
npm run db:seed      # seeds 4 sample campaigns
npm run dev:all      # web (3000) + worker concurrently

Open http://localhost:3000. Pick a seeded campaign, type a command into the agent panel ("Generate the next 4 candidates and draft an experiment PR"), watch the orchestrator plan → search literature → generate candidates → score / safety / IP → create an Experiment PR. Open the PR, click Approve, watch simulations queue, results land, and candidate rankings refresh.

If you prefer separate terminals:

npm run dev          # Next.js on :3000
npm run worker       # background worker in another shell

Other scripts:

npm test             # vitest (unit + API): scoring, simulation, schemas, orchestrator
npm run test:e2e     # playwright smoke (start dev + worker first)
npm run typecheck    # tsc --noEmit
npm run build        # production build
npm run db:reset     # nuke + reseed the local DB

Architecture

matter-agent/
  app/                  Next.js 14 App Router pages + API routes
    api/                campaigns, candidates, experiments, results, evidence, prs,
                        runs (incl. SSE /runs/[id]/stream), settings
    campaigns/[id]      3-pane workspace (sidebar / center / agent panel)
    campaigns/[id]/prs  Experiment-PR review screen
    settings            provider / model / adapter / queue status

  agents/               Typed multi-agent system
    types.ts            zod schemas for every agent IO
    runner.ts           AgentRunner — persists tasks + events, retries, LLM accounting
    specialists.ts      10 agent definitions (system prompts + schemas)
    orchestrator.ts     long-horizon orchestration: plan → lit → gen → score → safety
                        → IP → design → Experiment PR (waiting_approval)
                        + post-approval finalization
    simulation.ts       experiment runner that calls the simulation adapter
    scoring.ts          deterministic overall-score function + sub-score derivation

  llm/                  swappable LLM provider layer
    provider.ts         interface + factory + LLMCall accounting
    openai.ts           OpenAI Chat Completions JSON-mode wrapper
    anthropic.ts        Anthropic Messages JSON wrapper
    deterministic.ts    seeded local stub that produces schema-valid output from
                        request hash, sampling from per-domain tables. Lets the
                        app run end-to-end with no API keys.

  adapters/             pluggable real-world interfaces
    types.ts            LiteratureSourceAdapter, PatentSearchAdapter, SimulationAdapter,
                        LabAdapter, InstrumentResultAdapter, NotificationAdapter
    literature/         local-corpus literature search
    patents/            local-corpus patent search
    simulation/         deterministic property-prediction simulator (seeded, reproducible)
    lab/                manual-lab placeholder (creates queued manual tasks)
    notifications/      console notifier
    registry.ts         the single place to swap real adapters in

  lib/                  db, queue (DB-backed durable), config, api helpers, serializers

  worker/               separate Node process (BullMQ-style API, DB-backed)
    index.ts            workerId, tick loop, claim → process → finish/fail
    processors/         run-orchestrator, run-experiment, campaign-loop

  prisma/
    schema.prisma       all domain tables (campaigns, candidates, experiment_prs,
                        experiments, results, evidence, agent_runs, agent_tasks,
                        agent_events, llm_calls, settings, job)
    seed.ts             4 sample campaigns (cathode, electrolyte, ammonia catalyst,
                        anti-corrosion coating)

  data/corpus/          local literature + patent JSON corpora

  components/           AppShell, CampaignSidebar, CampaignHeader, ArtifactTabs,
                        AgentPanel, CandidateTable, CandidateDetail, ExperimentQueue,
                        ResultTable, EvidenceList, ObjectiveView, ExperimentPRReview,
                        SettingsPanel, StatusBadge

  tests/
    unit/               scoring, simulation, schemas, orchestrator (state-machine)
    api/                campaigns route
    e2e/                Playwright smoke: create campaign → command → approve PR
                        → see results

Agent flow

User command
  → POST /api/runs (AgentRun row created, status=queued)
  → enqueue 'orchestrator' job
worker.tick()
  → orchestrate()
    → OrchestratorAgent.plan
    → LiteratureAgent (local corpus)
    → CandidateGenerationAgent → Candidate rows
    → ScoringAgent + SafetyAgent + IPNoveltyAgent → updates Candidate rows
    → ExperimentDesignAgent → ExperimentPR (approval_status=pending)
    → AgentRun.status = waiting_approval (BLOCKS for human approval)
User approves PR
  → orchestratePostApproval() creates Experiment rows + enqueues 'experiment' jobs
worker.tick()
  → runExperiment() per Experiment
    → SimulationAdapter.simulate() → Result row + Evidence row
    → updates Candidate.predictedProperties + sub-scores + overallScore
  → when all PR experiments complete
    → ResultsAnalysisAgent → candidate updates + next-action recommendation
    → ReportAgent → AgentRun.outputSummary
    → Campaign.status = active (loop continues)

Every agent IO is validated with zod. Failed validation retries once, then the task is marked failed and surfaces in the agent panel as a red event. Every LLM call writes an LLMCall row with provider, model, prompt hash, input/output summaries, and latency. Raw model output never directly mutates the DB.

Experiment PR review

Experiment PR review screen

A real GitHub-PR-style review surface: goal, proposed experiments per candidate, cost / runtime / risk, information-gain estimate, the evidence the agent used, and Approve / Edit / Reject / Ask Agent. Approval is the only thing that lets the worker spend real (simulated) money — every expensive action is gated on a human.

Results land back as candidate updates

Per-experiment results refreshing candidate scores

When experiments finish, raw measurements (specific capacity, cycle life, cost per kWh, thermal runaway, manufacturability, cobalt fraction) are persisted as Result rows with Evidence provenance, and propagated back into per-candidate sub-scores and the overall ranking.

Long-horizon autonomy

A campaign-loop repeatable job ticks every 30s. For each active/running campaign with no pending PR and no in-flight experiments, it auto-enqueues an orchestrator run to propose the next experiment PR. The loop stops at every approval boundary — risky / expensive actions always require a human.


Configuration

All settings have safe defaults. Copy .env.example to .env and adjust:

Variable Purpose Default
DATABASE_URL Prisma datasource file:./dev.db
DEFAULT_LLM_PROVIDER openai / anthropic / deterministic deterministic
OPENAI_API_KEY Optional — switch provider in /settings empty
OPENAI_MODEL Model id string (no hard-coded versions) gpt-4o-mini
ANTHROPIC_API_KEY Optional empty
ANTHROPIC_MODEL Model id string claude-3-5-sonnet-latest
APP_URL Used for SSR URLs / Playwright http://localhost:3000
WORKER_TICK_MS Worker polling interval 1500

When no API keys are present the system runs in Local Deterministic Mode — the deterministic provider produces structured plausible outputs by hashing the prompt

  • sampling per-domain tables. The UI surfaces this clearly in the agent panel and the Settings page. Without keys the app is fully functional end to end.

Settings: provider / model / adapter / queue status


Deviations from the original spec

For practical local-first runnability the v1 swaps two infrastructure choices:

  1. SQLite via Prisma instead of Postgres. The schema is identical; switching to Postgres only requires changing datasource.provider and the DATABASE_URL.
  2. DB-backed durable queue (the Job table + lib/queue.ts) instead of BullMQ + Redis. The surface (enqueue / claimNextJob / finishJob / failJob) is intentionally BullMQ-shaped so swapping is mechanical. Worker is still a fully separate Node process; state survives restarts.

Both choices preserve every architectural property the spec required (real persistence, separate worker process, durable jobs) with zero external infra. Postgres + Redis + BullMQ is the recommended production swap.

Everything else — multi-agent orchestration, LLM provider abstraction, adapter interfaces, Experiment-PR approval gating, long-horizon loop, real persisted state, Cursor-style UX — is implemented as specified.


Known limitations

  • Single-user, no auth. Local-first per spec.
  • Deterministic simulation adapter produces plausible but not scientifically validated property predictions. The UI labels results as "simulated, local adapter".
  • Real lab / instrument / external literature / external patent integrations are stub interfaces only — never present their output as real wet-lab data.
  • Safety / IP / novelty outputs are advisory, not legal or hazard advice.
  • The autonomous loop only auto-advances campaigns that have at least one candidate already; bootstrap requires a human command.

Test results

npm test
> vitest run

  ✓ tests/unit/scoring.test.ts        (6 tests)
  ✓ tests/unit/simulation.test.ts     (4 tests)
  ✓ tests/unit/schemas.test.ts        (5 tests)
  ✓ tests/unit/orchestrator.test.ts   (2 tests, ~100ms total)
  ✓ tests/api/campaigns.test.ts       (2 tests)

  Test Files  5 passed (5)
  Tests       19 passed (19)

npm run test:e2e
  ✓ tests/e2e/smoke.spec.ts (create campaign → command → PR → approve → results)
                                                                       1 passed (~8s)

E2E smoke captures the full flow locally as screenshots under tests/e2e/screenshots/01..05-*.png (workspace with pending PR, PR review, results, final workspace, settings) — these are gitignored, the curated set under docs/ is what ships with the repo.

Verified against the live Anthropic API

Also exercised end-to-end against the real claude-sonnet-4-5 (resolves to claude-sonnet-4-5-20250929):

campaign:  "Live cobalt-free cathode"  (15% lower cost than NMC811, >200 mAh/g, >1000 cycles)
command:   "Generate 4 diverse cobalt-free cathode candidates and draft an
            Experiment PR under $20,000 focused on cycle life."

orchestrator      anthropic/claude-sonnet-4-5    46.2s  7-step plan
literature        anthropic/claude-sonnet-4-5    44.4s  5 takeaways persisted as Evidence
candidate-gen     anthropic/claude-sonnet-4-5   145.0s  8 real candidates: LFP-High-Purity,
                                                       LMFP-Mn-Doped, LFMP-Carbon-Coated,
                                                       LNM-Layered, NCA-Cobalt-Free,
                                                       NM-Disordered-Rock-Salt, LMO-Spinel,
                                                       LNMO-High-Voltage  (real chemistries +
                                                       real rationale)
scoring           anthropic/claude-sonnet-4-5    39.0s
safety            anthropic/claude-sonnet-4-5    33.8s
ip-novelty        anthropic/claude-sonnet-4-5    60.0s
experiment-design anthropic/claude-sonnet-4-5    36.1s  ExperimentPR "High-Performance LFP/LMFP
                                                       Validation and Optimization Batch",
                                                       5 experiments, $60k, low risk
report            anthropic/claude-sonnet-4-5     6.2s  campaign summary

  ↓ human approves PR ↓

simulation x5     deterministic local adapter    <3s    real per-candidate measurements:
                                                       LFP-High-Purity 180.4 mAh/g, 1116 cycles, $86.3/kWh
                                                       LMFP-Mn-Doped   211.3 mAh/g,  834 cycles, $94.6/kWh
                                                       LFMP-Carbon-C.  173.6 mAh/g, 1006 cycles, $81.2/kWh
                                                       (all cobalt-free, candidate sub-scores
                                                       refreshed, overall scores re-ranked)

results-analysis  anthropic/claude-sonnet-4-5    26.2s  candidate updates + next-action rec
report            anthropic/claude-sonnet-4-5     5.2s  campaign summary:
                                                       "...successfully identified LFP-High-Purity
                                                       as the top-performing candidate,
                                                       demonstrating superior electrochemical
                                                       performance metrics. LFMP-Carbon-Coated
                                                       emerged as a strong secondary candidate,
                                                       warranting direct comparison studies..."
orchestrator      → "Initiate scale-up feasibility study for LFP-High-Purity (top performer)
                     and conduct head-to-head comparison..."

end-to-end wall clock:  ~7 min orchestration + <3s simulation + ~31s analysis = ~8 min

The four screenshots in this README (docs/hero-workspace.png, docs/pr-review.png, docs/results.png, docs/settings.png) are all captured from this exact live run.


Verified flow

  1. Open the app — campaign sidebar populated, agent panel ready.
  2. Create a campaign (name, objective, domain) from the sidebar.
  3. Type a command (e.g. "Generate the next 4 candidates and draft an experiment PR under $20,000") — an AgentRun appears in the right panel.
  4. Background agents run; the timeline updates: orchestrator plan → literature search → candidate generation → scoring → safety → IP novelty → experiment design.
  5. An Experiment PR appears; the workspace shows a banner; the agent run transitions to waiting_approval.
  6. Open the PR (Cursor-diff style review with goal, proposed experiments, cost / runtime / risk, evidence, Approve / Edit / Reject / Ask Agent).
  7. Approve. Experiments are created with status queued. The worker picks them up, the simulation adapter runs, Result rows are persisted, Candidate scores refresh.
  8. Results-analysis + Report agents summarize what changed and recommend the next action. Campaign goes back to active. The autonomous loop will propose the next PR.

All state survives restarts. npm run db:reset to start over.

About

AI Agent Harness for Material R&D

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages