[feat]: Add prompt optimisation script#28
Draft
LeoRoccoBreedt wants to merge 10 commits into
Draft
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… method and use within the optimisation
…ute all traces to repo project Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…scalation detection
- Switch metric to flag_only_metric (ScoreResult) for phase 1
- invoke_agent returns JSON with comment + escalated bool
- Escalation detected via sim.get_issue_data labels post-run, not comment text
- Add _parse_output helper used by all metrics to parse invoke_agent output
- Set enable_context=False + prompt_overrides to prevent {data}/{issue_message} placeholders
- Remove auto-save of best prompt to Opik — user promotes manually
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… judge - Extract all metrics into optimisation/metrics.py to keep main script clean - Add triage_accuracy metric: flag accuracy (0.6) + reply quality judge (0.4) - Add _reply_quality helper using claude-haiku for cost-efficient judging - Strip markdown code fences from judge response before JSON parsing - Update _scout_reasoning_override to include reply quality context - Switch main() from flag_only_metric to triage_accuracy for phase 2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ca9cb79 to
44213c3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #22
Summary
Implements a 4-phase prompt optimisation pipeline for Scout using Opik's
opik_optimizer, with Phases 1 and 2 complete and validated.What's included
optimisation/run_prompt_optimisation.py— main orchestration scriptscout-triage-system-prompt-initial) and runs the full Scout agent loop against thescout-triage-optimisation-runs-v2datasetMetaPromptOptimizerwithenable_context=False+prompt_overridesto give the meta-LLM Scout-specific task context without advertising dataset fields as template variables (prevents{data}/{issue_message}placeholders appearing in candidate prompts)invoke_agentreturns JSON{"comment": str, "escalated": bool}where escalation is read from the simulator's post-run label state — the true signal fromapply_label(), not comment text matchingoptimisation/metrics.py— all metric functions, separated to keep the main script cleanflag_only_metric: binary escalation flag correctness only, no LLM judge call. Baseline 0.4 → 1.0 (+150%)triage_accuracy: flag accuracy (0.6 weight) + reply quality LLM judge (0.4 weight). Judge usesclaude-haiku-4-5for cost efficiency. Baseline 0.36 → 0.924 (+157%)escalation_accuracy,answer_relevance,scout_quality) kept for future phases_parse_outputfor JSON output parsing,_expected_escalationfor dataset field accessPhase roadmap
flag_only_metrictriage_accuracy(flag + reply judge)triage_accuracyMultiMetricObjective(accuracy + cost + latency)Key design decisions
sim.get_issue_data(target)["labels"]post-run rather than scanning comment text —apply_label()writes to the simulator directly and may not mention the tag in the commentenable_context=False+prompt_overridespreferred overtask_context_columns— the latter is undocumented in the public APITest plan
optimisation/run_prompt_optimisation.pyand verify baseline + candidate scores are non-zero{data}or{issue_message}placeholders appear in generated candidate prompts🤖 Generated with Claude Code