[feat]: Add prompt optimisation script by LeoRoccoBreedt · Pull Request #28 · comet-ml/scout-repo-agent

LeoRoccoBreedt · 2026-06-05T17:23:35Z

Resolves #22

Summary

Implements a 4-phase prompt optimisation pipeline for Scout using Opik's opik_optimizer, with Phases 1 and 2 complete and validated.

What's included

optimisation/run_prompt_optimisation.py — main orchestration script

Loads the starting prompt from Opik (scout-triage-system-prompt-initial) and runs the full Scout agent loop against the scout-triage-optimisation-runs-v2 dataset
Uses MetaPromptOptimizer with enable_context=False + prompt_overrides to give the meta-LLM Scout-specific task context without advertising dataset fields as template variables (prevents {data}/{issue_message} placeholders appearing in candidate prompts)
invoke_agent returns JSON {"comment": str, "escalated": bool} where escalation is read from the simulator's post-run label state — the true signal from apply_label(), not comment text matching
Does not auto-save the best prompt — results are reviewed in Opik UI and promoted manually

optimisation/metrics.py — all metric functions, separated to keep the main script clean

Phase 1 — flag_only_metric: binary escalation flag correctness only, no LLM judge call. Baseline 0.4 → 1.0 (+150%)
Phase 2 — triage_accuracy: flag accuracy (0.6 weight) + reply quality LLM judge (0.4 weight). Judge uses claude-haiku-4-5 for cost efficiency. Baseline 0.36 → 0.924 (+157%)
Legacy metrics (escalation_accuracy, answer_relevance, scout_quality) kept for future phases
Shared helpers: _parse_output for JSON output parsing, _expected_escalation for dataset field access

Phase roadmap

Phase	Metric	Optimiser	Status
1	`flag_only_metric`	MetaPromptOptimizer	✅ Complete
2	`triage_accuracy` (flag + reply judge)	MetaPromptOptimizer	✅ Complete
3	`triage_accuracy`	HierarchicalReflectiveOptimizer	Planned
4	`MultiMetricObjective` (accuracy + cost + latency)	EvolutionaryOptimizer	Planned

Key design decisions

Escalation detection reads sim.get_issue_data(target)["labels"] post-run rather than scanning comment text — apply_label() writes to the simulator directly and may not mention the tag in the comment
enable_context=False + prompt_overrides preferred over task_context_columns — the latter is undocumented in the public API
Reply quality judge strips markdown code fences from Haiku's response before JSON parsing
Best prompts are not auto-promoted — reviewed in Opik dashboard first

Test plan

Run optimisation/run_prompt_optimisation.py and verify baseline + candidate scores are non-zero
Confirm no {data} or {issue_message} placeholders appear in generated candidate prompts
Check Opik dashboard for optimisation run traces and review best prompt before promoting

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… method and use within the optimisation

…ute all traces to repo project Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

…scalation detection - Switch metric to flag_only_metric (ScoreResult) for phase 1 - invoke_agent returns JSON with comment + escalated bool - Escalation detected via sim.get_issue_data labels post-run, not comment text - Add _parse_output helper used by all metrics to parse invoke_agent output - Set enable_context=False + prompt_overrides to prevent {data}/{issue_message} placeholders - Remove auto-save of best prompt to Opik — user promotes manually Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… judge - Extract all metrics into optimisation/metrics.py to keep main script clean - Add triage_accuracy metric: flag accuracy (0.6) + reply quality judge (0.4) - Add _reply_quality helper using claude-haiku for cost-efficient judging - Strip markdown code fences from judge response before JSON parsing - Update _scout_reasoning_override to include reply quality context - Switch main() from flag_only_metric to triage_accuracy for phase 2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

LeoRoccoBreedt and others added 10 commits June 10, 2026 15:47

feat: add opik-optimizer dependency for prompt optimisation

8c1f817

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: initial prompt optimisation script

fcbfa54

refactor: seed the message passed to the agent with the build_message…

b56b3b1

… method and use within the optimisation

feat(optimisation): fix prompt pipeline, add scout_quality metric, ro…

904795a

…ute all traces to repo project Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

chore: ignore csv, lock, and json files

b705fc2

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

chore: change the default system prompt to optimise

b59b4d8

fix: suppress E402 lint errors in optimisation script

09bd6c5

feat: add phase 1 of a flag only metric

d7da2a8

LeoRoccoBreedt force-pushed the add-optimisation-script branch from ca9cb79 to 44213c3 Compare June 10, 2026 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat]: Add prompt optimisation script#28

[feat]: Add prompt optimisation script#28
LeoRoccoBreedt wants to merge 10 commits into
mainfrom
add-optimisation-script

LeoRoccoBreedt commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LeoRoccoBreedt commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Phase roadmap

Key design decisions

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeoRoccoBreedt commented Jun 5, 2026 •

edited

Loading