feat(optimizer): closed-loop prompt optimizer + improvement notes by DataGomes · Pull Request #68 · awslabs/llm-evaluation-system

DataGomes · 2026-05-19T02:53:53Z

Summary

Prompt-side analog of the criteria critic loop. Two layers:

A. Per-failure improvement notes in the jury. Each judge call now writes <criterion>_improvement strings on failures. Surfaces in Score.metadata.criteria_results, flows through get_evaluation_details unchanged. Backward compatible.
B. optimize_prompt MCP tool. Train/test split → iterate (score sample → optimizer LLM proposes new prompt from failures+notes → re-score) → winner by held-out test score. Crash-safe last-good fallback on any iteration error.

Frontend adds a Prompts Optimized tab alongside Results — rail of runs on the left, detail pane with a train-pass-rate chart (winner's test score as reference line), initial-vs-winner prompt diff, and an expandable iteration history with per-iter rationale.

Why

Users today can run an eval, see what failed, and tweak a prompt by hand. The criteria critic loop fixed the judge side; this fixes the prompt side. Skill-creator's run_loop.py is the precedent — same shape (read failures → LLM proposes → re-run → keep best) just applied to prompt templates instead of skill descriptions.

The optimizer LLM is explicitly told to preserve structure on long prompts — edit lines, tweak constraints, reorder, but don't compress a carefully-tuned multi-section prompt into a short one without strong evidence the structure is wrong. Anti-overfit machinery carried from run_loop.py: stratified random split with fixed seed, optimizer sees only train results, winner by test score (ties broken earlier-iter-wins), full prompt in context (no truncation), prior attempts fed back with "don't repeat" framing.

Pragmatic v1 simplification

Iterations run in-process (one judge call per sample, default Bedrock model from the BedrockClient singleton) instead of spawning Inspect AI subprocesses per iteration. ~10× faster per iteration. Tradeoff: iterations don't show up in list_evaluations. Users who want a "real eval" entry for the winner call `create_eval_config(prompts=[winner])` themselves afterward. Multi-provider scoring during the loop deferred to v2.

What's new

Backend (MCP)

`eval_mcp/tools/optimize_prompt.py` — the loop
`eval_mcp/tools/list_optimizations.py` + `get_optimization_details.py` — read tools
`eval_mcp/core/user_storage.py` — `save_optimization_to_db` / get / list / delete
`eval_mcp/tools/create_config.py` — jury schema + system prompt now capture per-criterion improvement notes
`eval_mcp/server.py` — three new `@mcp.tool` registrations

HTTP API

`backend/api/optimizations.py` — `GET /api/optimizations/{list,detail}`
`backend/api/main.py` — router registration

Frontend (Next.js App Router)

`frontend/components/Header.tsx` — Prompts Optimized nav entry
`frontend/app/optimizations/page.tsx` — page with rail + detail layout
`frontend/components/optimizations/OptimizationRail.tsx` — list of runs
`frontend/components/optimizations/OptimizationDetail.tsx` — chart, prompt diff, iteration history

Tests

`tests/test_optimize_prompt.py` — 21 unit tests (split, winner selection, failure filtering, loop branches with mocked LLM) + one Bedrock-gated integration test

Env knobs

`EVAL_MCP_OPTIMIZE_MAX_ITERATIONS` (default 3)
`EVAL_MCP_OPTIMIZE_SAMPLE_SIZE` (default 10)
`EVAL_MCP_OPTIMIZE_TEST_HOLDOUT` (default 0.4)

Test plan

Unit tests: `uv run --extra dev pytest tests/test_optimize_prompt.py -v -k 'not integration'` → 21/21 passing
Integration test against real Bedrock: `uv run --extra dev pytest tests/test_optimize_prompt.py::test_integration_optimizer_does_not_regress -v -s` → passes in 17s. (Plain `{question}` on factual goldens converges at iter 0 — the regression-prevention assertion holds; real signal needs a weaker starting prompt + harder dataset.)
Full Python suite: `uv run --extra dev pytest tests/ --ignore=tests/test_mcp_eval.py --ignore=tests/test_langchain_subprocess_e2e.py --ignore=tests/test_data_api.py -x` → 152 passed, 1 skipped.
Frontend TypeScript: `cd frontend && ./node_modules/.bin/tsc --noEmit` → clean.
Post-merge manual smoke: start MCP + frontend, run `optimize_prompt` from chat against a real dataset, click through the new tab and verify chart + diff render.

Adds the prompt-side analog of the criteria critic loop. Two layers: A. Improvement notes in the jury. Each judge call now writes a per-criterion <name>_improvement string on failures (one sentence on what the answer should change). Backward compatible — old logs work unchanged. Surfaces in Score.metadata.criteria_results, flows through the existing eval_results.py read path automatically. B. optimize_prompt MCP tool. Train/test split, in-process scoring, optimizer LLM proposes new prompts based on failures + improvement notes, winner selected by held-out test score (ties broken by earlier iter). Crash-safe last-good fallback on any iteration error. Same shape as generate_judge.refine_criteria_loop. Anti-overfit machinery carried from skill-creator's run_loop.py: - 60/40 stratified random split with fixed seed. - Optimizer LLM sees only train results; held-out test never leaks into the improvement context. - Winner picked by test score, not train. - History feed includes prior attempts with "don't repeat" framing. - Full prompt is passed in (no truncation) so the optimizer can do targeted edits instead of compressing a long prompt into a short one. The optimizer prompt explicitly says to preserve structure on long prompts — edit lines, tweak constraints, reorder, but do not compress a carefully-tuned multi-section prompt without strong evidence the structure itself is wrong. Read tools (list_optimizations, get_optimization_details) mirror list_evaluations / get_evaluation_details. HTTP API exposes the same data at /api/optimizations/{list,detail} mirroring /api/compare. Frontend adds a Prompts Optimized nav entry with a rail + detail layout: train pass-rate chart with winner's test score reference line, initial-vs-winner prompt diff, expandable iteration history with per-iter rationale. Pragmatic v1 simplification: iterations run in-process (single judge call per sample, default Bedrock model from BedrockClient singleton) instead of spawning Inspect AI subprocesses per iteration. ~10× faster per iteration; tradeoff is the iterations don't show in list_evaluations. Users who want a "real eval" entry for the winner call create_eval_config(prompts=[winner]) afterward. Tests: 21 unit tests cover the split, winner selection, failure filtering, and the loop's branches with mocked LLM. One Bedrock-gated integration test asserts the winner doesn't regress below the initial prompt's test score. Frontend type-checks clean. Env knobs: - EVAL_MCP_OPTIMIZE_MAX_ITERATIONS (default 3) - EVAL_MCP_OPTIMIZE_SAMPLE_SIZE (default 10) - EVAL_MCP_OPTIMIZE_TEST_HOLDOUT (default 0.4)

Two small follow-ups: - The new page wasn't rendering <Header />, so clicking "Prompts Optimized" wiped the nav bar. Render it like /data and /chat do. - Reorder NAV so "Prompts Optimized" lands after "Data" — it's the least-frequently-used surface and belongs at the tail.

DataGomes added 2 commits May 19, 2026 02:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(optimizer): closed-loop prompt optimizer + improvement notes#68

feat(optimizer): closed-loop prompt optimizer + improvement notes#68
DataGomes wants to merge 2 commits into
mainfrom
feat/prompt-optimizer

DataGomes commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DataGomes commented May 19, 2026

Summary

Why

Pragmatic v1 simplification

What's new

Env knobs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant