feat(optimizer): closed-loop prompt optimizer + improvement notes#68
Open
DataGomes wants to merge 2 commits into
Open
feat(optimizer): closed-loop prompt optimizer + improvement notes#68DataGomes wants to merge 2 commits into
DataGomes wants to merge 2 commits into
Conversation
Adds the prompt-side analog of the criteria critic loop. Two layers:
A. Improvement notes in the jury. Each judge call now writes a
per-criterion <name>_improvement string on failures (one sentence on
what the answer should change). Backward compatible — old logs work
unchanged. Surfaces in Score.metadata.criteria_results, flows
through the existing eval_results.py read path automatically.
B. optimize_prompt MCP tool. Train/test split, in-process scoring,
optimizer LLM proposes new prompts based on failures + improvement
notes, winner selected by held-out test score (ties broken by
earlier iter). Crash-safe last-good fallback on any iteration error.
Same shape as generate_judge.refine_criteria_loop.
Anti-overfit machinery carried from skill-creator's run_loop.py:
- 60/40 stratified random split with fixed seed.
- Optimizer LLM sees only train results; held-out test never leaks
into the improvement context.
- Winner picked by test score, not train.
- History feed includes prior attempts with "don't repeat" framing.
- Full prompt is passed in (no truncation) so the optimizer can do
targeted edits instead of compressing a long prompt into a short one.
The optimizer prompt explicitly says to preserve structure on long
prompts — edit lines, tweak constraints, reorder, but do not compress
a carefully-tuned multi-section prompt without strong evidence the
structure itself is wrong.
Read tools (list_optimizations, get_optimization_details) mirror
list_evaluations / get_evaluation_details. HTTP API exposes the same
data at /api/optimizations/{list,detail} mirroring /api/compare.
Frontend adds a Prompts Optimized nav entry with a rail + detail
layout: train pass-rate chart with winner's test score reference line,
initial-vs-winner prompt diff, expandable iteration history with
per-iter rationale.
Pragmatic v1 simplification: iterations run in-process (single judge
call per sample, default Bedrock model from BedrockClient singleton)
instead of spawning Inspect AI subprocesses per iteration. ~10× faster
per iteration; tradeoff is the iterations don't show in
list_evaluations. Users who want a "real eval" entry for the winner
call create_eval_config(prompts=[winner]) afterward.
Tests: 21 unit tests cover the split, winner selection, failure
filtering, and the loop's branches with mocked LLM. One Bedrock-gated
integration test asserts the winner doesn't regress below the initial
prompt's test score. Frontend type-checks clean.
Env knobs:
- EVAL_MCP_OPTIMIZE_MAX_ITERATIONS (default 3)
- EVAL_MCP_OPTIMIZE_SAMPLE_SIZE (default 10)
- EVAL_MCP_OPTIMIZE_TEST_HOLDOUT (default 0.4)
Two small follow-ups: - The new page wasn't rendering <Header />, so clicking "Prompts Optimized" wiped the nav bar. Render it like /data and /chat do. - Reorder NAV so "Prompts Optimized" lands after "Data" — it's the least-frequently-used surface and belongs at the tail.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Prompt-side analog of the criteria critic loop. Two layers:
<criterion>_improvementstrings on failures. Surfaces inScore.metadata.criteria_results, flows throughget_evaluation_detailsunchanged. Backward compatible.optimize_promptMCP tool. Train/test split → iterate (score sample → optimizer LLM proposes new prompt from failures+notes → re-score) → winner by held-out test score. Crash-safe last-good fallback on any iteration error.Frontend adds a Prompts Optimized tab alongside Results — rail of runs on the left, detail pane with a train-pass-rate chart (winner's test score as reference line), initial-vs-winner prompt diff, and an expandable iteration history with per-iter rationale.
Why
Users today can run an eval, see what failed, and tweak a prompt by hand. The criteria critic loop fixed the judge side; this fixes the prompt side. Skill-creator's
run_loop.pyis the precedent — same shape (read failures → LLM proposes → re-run → keep best) just applied to prompt templates instead of skill descriptions.The optimizer LLM is explicitly told to preserve structure on long prompts — edit lines, tweak constraints, reorder, but don't compress a carefully-tuned multi-section prompt into a short one without strong evidence the structure is wrong. Anti-overfit machinery carried from
run_loop.py: stratified random split with fixed seed, optimizer sees only train results, winner by test score (ties broken earlier-iter-wins), full prompt in context (no truncation), prior attempts fed back with "don't repeat" framing.Pragmatic v1 simplification
Iterations run in-process (one judge call per sample, default Bedrock model from the
BedrockClientsingleton) instead of spawning Inspect AI subprocesses per iteration. ~10× faster per iteration. Tradeoff: iterations don't show up inlist_evaluations. Users who want a "real eval" entry for the winner call `create_eval_config(prompts=[winner])` themselves afterward. Multi-provider scoring during the loop deferred to v2.What's new
Backend (MCP)
HTTP API
Frontend (Next.js App Router)
Tests
Env knobs
Test plan