Skip to content

[feat]: Add prompt optimisation script#28

Draft
LeoRoccoBreedt wants to merge 10 commits into
mainfrom
add-optimisation-script
Draft

[feat]: Add prompt optimisation script#28
LeoRoccoBreedt wants to merge 10 commits into
mainfrom
add-optimisation-script

Conversation

@LeoRoccoBreedt

@LeoRoccoBreedt LeoRoccoBreedt commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Resolves #22

Summary

Implements a 4-phase prompt optimisation pipeline for Scout using Opik's opik_optimizer, with Phases 1 and 2 complete and validated.

What's included

optimisation/run_prompt_optimisation.py — main orchestration script

  • Loads the starting prompt from Opik (scout-triage-system-prompt-initial) and runs the full Scout agent loop against the scout-triage-optimisation-runs-v2 dataset
  • Uses MetaPromptOptimizer with enable_context=False + prompt_overrides to give the meta-LLM Scout-specific task context without advertising dataset fields as template variables (prevents {data}/{issue_message} placeholders appearing in candidate prompts)
  • invoke_agent returns JSON {"comment": str, "escalated": bool} where escalation is read from the simulator's post-run label state — the true signal from apply_label(), not comment text matching
  • Does not auto-save the best prompt — results are reviewed in Opik UI and promoted manually

optimisation/metrics.py — all metric functions, separated to keep the main script clean

  • Phase 1flag_only_metric: binary escalation flag correctness only, no LLM judge call. Baseline 0.4 → 1.0 (+150%)
  • Phase 2triage_accuracy: flag accuracy (0.6 weight) + reply quality LLM judge (0.4 weight). Judge uses claude-haiku-4-5 for cost efficiency. Baseline 0.36 → 0.924 (+157%)
  • Legacy metrics (escalation_accuracy, answer_relevance, scout_quality) kept for future phases
  • Shared helpers: _parse_output for JSON output parsing, _expected_escalation for dataset field access

Phase roadmap

Phase Metric Optimiser Status
1 flag_only_metric MetaPromptOptimizer ✅ Complete
2 triage_accuracy (flag + reply judge) MetaPromptOptimizer ✅ Complete
3 triage_accuracy HierarchicalReflectiveOptimizer Planned
4 MultiMetricObjective (accuracy + cost + latency) EvolutionaryOptimizer Planned

Key design decisions

  • Escalation detection reads sim.get_issue_data(target)["labels"] post-run rather than scanning comment text — apply_label() writes to the simulator directly and may not mention the tag in the comment
  • enable_context=False + prompt_overrides preferred over task_context_columns — the latter is undocumented in the public API
  • Reply quality judge strips markdown code fences from Haiku's response before JSON parsing
  • Best prompts are not auto-promoted — reviewed in Opik dashboard first

Test plan

  • Run optimisation/run_prompt_optimisation.py and verify baseline + candidate scores are non-zero
  • Confirm no {data} or {issue_message} placeholders appear in generated candidate prompts
  • Check Opik dashboard for optimisation run traces and review best prompt before promoting

🤖 Generated with Claude Code

LeoRoccoBreedt and others added 10 commits June 10, 2026 15:47
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ute all traces to repo project

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…scalation detection

- Switch metric to flag_only_metric (ScoreResult) for phase 1
- invoke_agent returns JSON with comment + escalated bool
- Escalation detected via sim.get_issue_data labels post-run, not comment text
- Add _parse_output helper used by all metrics to parse invoke_agent output
- Set enable_context=False + prompt_overrides to prevent {data}/{issue_message} placeholders
- Remove auto-save of best prompt to Opik — user promotes manually

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… judge

- Extract all metrics into optimisation/metrics.py to keep main script clean
- Add triage_accuracy metric: flag accuracy (0.6) + reply quality judge (0.4)
- Add _reply_quality helper using claude-haiku for cost-efficient judging
- Strip markdown code fences from judge response before JSON parsing
- Update _scout_reasoning_override to include reply quality context
- Switch main() from flag_only_metric to triage_accuracy for phase 2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LeoRoccoBreedt LeoRoccoBreedt force-pushed the add-optimisation-script branch from ca9cb79 to 44213c3 Compare June 10, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR]: Add prompt optimisation script

1 participant