Skip to content

feat(optimizer): closed-loop prompt optimizer + improvement notes#68

Open
DataGomes wants to merge 2 commits into
mainfrom
feat/prompt-optimizer
Open

feat(optimizer): closed-loop prompt optimizer + improvement notes#68
DataGomes wants to merge 2 commits into
mainfrom
feat/prompt-optimizer

Conversation

@DataGomes
Copy link
Copy Markdown
Contributor

Summary

Prompt-side analog of the criteria critic loop. Two layers:

  • A. Per-failure improvement notes in the jury. Each judge call now writes <criterion>_improvement strings on failures. Surfaces in Score.metadata.criteria_results, flows through get_evaluation_details unchanged. Backward compatible.
  • B. optimize_prompt MCP tool. Train/test split → iterate (score sample → optimizer LLM proposes new prompt from failures+notes → re-score) → winner by held-out test score. Crash-safe last-good fallback on any iteration error.

Frontend adds a Prompts Optimized tab alongside Results — rail of runs on the left, detail pane with a train-pass-rate chart (winner's test score as reference line), initial-vs-winner prompt diff, and an expandable iteration history with per-iter rationale.

Why

Users today can run an eval, see what failed, and tweak a prompt by hand. The criteria critic loop fixed the judge side; this fixes the prompt side. Skill-creator's run_loop.py is the precedent — same shape (read failures → LLM proposes → re-run → keep best) just applied to prompt templates instead of skill descriptions.

The optimizer LLM is explicitly told to preserve structure on long prompts — edit lines, tweak constraints, reorder, but don't compress a carefully-tuned multi-section prompt into a short one without strong evidence the structure is wrong. Anti-overfit machinery carried from run_loop.py: stratified random split with fixed seed, optimizer sees only train results, winner by test score (ties broken earlier-iter-wins), full prompt in context (no truncation), prior attempts fed back with "don't repeat" framing.

Pragmatic v1 simplification

Iterations run in-process (one judge call per sample, default Bedrock model from the BedrockClient singleton) instead of spawning Inspect AI subprocesses per iteration. ~10× faster per iteration. Tradeoff: iterations don't show up in list_evaluations. Users who want a "real eval" entry for the winner call `create_eval_config(prompts=[winner])` themselves afterward. Multi-provider scoring during the loop deferred to v2.

What's new

Backend (MCP)

  • `eval_mcp/tools/optimize_prompt.py` — the loop
  • `eval_mcp/tools/list_optimizations.py` + `get_optimization_details.py` — read tools
  • `eval_mcp/core/user_storage.py` — `save_optimization_to_db` / get / list / delete
  • `eval_mcp/tools/create_config.py` — jury schema + system prompt now capture per-criterion improvement notes
  • `eval_mcp/server.py` — three new `@mcp.tool` registrations

HTTP API

  • `backend/api/optimizations.py` — `GET /api/optimizations/{list,detail}`
  • `backend/api/main.py` — router registration

Frontend (Next.js App Router)

  • `frontend/components/Header.tsx` — Prompts Optimized nav entry
  • `frontend/app/optimizations/page.tsx` — page with rail + detail layout
  • `frontend/components/optimizations/OptimizationRail.tsx` — list of runs
  • `frontend/components/optimizations/OptimizationDetail.tsx` — chart, prompt diff, iteration history

Tests

  • `tests/test_optimize_prompt.py` — 21 unit tests (split, winner selection, failure filtering, loop branches with mocked LLM) + one Bedrock-gated integration test

Env knobs

  • `EVAL_MCP_OPTIMIZE_MAX_ITERATIONS` (default 3)
  • `EVAL_MCP_OPTIMIZE_SAMPLE_SIZE` (default 10)
  • `EVAL_MCP_OPTIMIZE_TEST_HOLDOUT` (default 0.4)

Test plan

  • Unit tests: `uv run --extra dev pytest tests/test_optimize_prompt.py -v -k 'not integration'` → 21/21 passing
  • Integration test against real Bedrock: `uv run --extra dev pytest tests/test_optimize_prompt.py::test_integration_optimizer_does_not_regress -v -s` → passes in 17s. (Plain `{question}` on factual goldens converges at iter 0 — the regression-prevention assertion holds; real signal needs a weaker starting prompt + harder dataset.)
  • Full Python suite: `uv run --extra dev pytest tests/ --ignore=tests/test_mcp_eval.py --ignore=tests/test_langchain_subprocess_e2e.py --ignore=tests/test_data_api.py -x` → 152 passed, 1 skipped.
  • Frontend TypeScript: `cd frontend && ./node_modules/.bin/tsc --noEmit` → clean.
  • Post-merge manual smoke: start MCP + frontend, run `optimize_prompt` from chat against a real dataset, click through the new tab and verify chart + diff render.

DataGomes added 2 commits May 19, 2026 02:53
Adds the prompt-side analog of the criteria critic loop. Two layers:

A. Improvement notes in the jury. Each judge call now writes a
   per-criterion <name>_improvement string on failures (one sentence on
   what the answer should change). Backward compatible — old logs work
   unchanged. Surfaces in Score.metadata.criteria_results, flows
   through the existing eval_results.py read path automatically.

B. optimize_prompt MCP tool. Train/test split, in-process scoring,
   optimizer LLM proposes new prompts based on failures + improvement
   notes, winner selected by held-out test score (ties broken by
   earlier iter). Crash-safe last-good fallback on any iteration error.
   Same shape as generate_judge.refine_criteria_loop.

Anti-overfit machinery carried from skill-creator's run_loop.py:
- 60/40 stratified random split with fixed seed.
- Optimizer LLM sees only train results; held-out test never leaks
  into the improvement context.
- Winner picked by test score, not train.
- History feed includes prior attempts with "don't repeat" framing.
- Full prompt is passed in (no truncation) so the optimizer can do
  targeted edits instead of compressing a long prompt into a short one.

The optimizer prompt explicitly says to preserve structure on long
prompts — edit lines, tweak constraints, reorder, but do not compress
a carefully-tuned multi-section prompt without strong evidence the
structure itself is wrong.

Read tools (list_optimizations, get_optimization_details) mirror
list_evaluations / get_evaluation_details. HTTP API exposes the same
data at /api/optimizations/{list,detail} mirroring /api/compare.
Frontend adds a Prompts Optimized nav entry with a rail + detail
layout: train pass-rate chart with winner's test score reference line,
initial-vs-winner prompt diff, expandable iteration history with
per-iter rationale.

Pragmatic v1 simplification: iterations run in-process (single judge
call per sample, default Bedrock model from BedrockClient singleton)
instead of spawning Inspect AI subprocesses per iteration. ~10× faster
per iteration; tradeoff is the iterations don't show in
list_evaluations. Users who want a "real eval" entry for the winner
call create_eval_config(prompts=[winner]) afterward.

Tests: 21 unit tests cover the split, winner selection, failure
filtering, and the loop's branches with mocked LLM. One Bedrock-gated
integration test asserts the winner doesn't regress below the initial
prompt's test score. Frontend type-checks clean.

Env knobs:
- EVAL_MCP_OPTIMIZE_MAX_ITERATIONS (default 3)
- EVAL_MCP_OPTIMIZE_SAMPLE_SIZE (default 10)
- EVAL_MCP_OPTIMIZE_TEST_HOLDOUT (default 0.4)
Two small follow-ups:

- The new page wasn't rendering <Header />, so clicking "Prompts
  Optimized" wiped the nav bar. Render it like /data and /chat do.
- Reorder NAV so "Prompts Optimized" lands after "Data" — it's the
  least-frequently-used surface and belongs at the tail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant