Context
Child of #90. Blocks on #104 (prepare.py) so the policy guidance can reference the right metrics.
program.md is the task guidance the autoresearch-rl LLM policy reads to decide what hyperparameters to try and what code diffs to propose. config.yaml is the autoresearch-rl experiment config.
Scope
1. program.md rewrite
Update autoresearch-rl/examples/security-judge/program.md to:
- State the new objective: maximise
eval_score (composite of 7 components; weights in eval_protocol.json).
- List every component with its weight, so the LLM knows how to trade them off.
- Provide code-diff guidance per component — examples:
- JSON compliance tanking → tighten response_format prompt, lower temperature, add format-penalty reward
- is_threat accuracy lagging category accuracy → emphasise binary reward component
- Reasoning length runaway → length-penalty past 400 chars
The existing program.md has a good skeleton; the substance needs rewriting for 6-field.
2. config.yaml updates
policy.params:
learning_rate: [5e-5, 1e-4, 3e-4]
max_steps: [50, 100, 150] # 6-field needs more steps; was [30, 50, 80]
num_generations: [3, 4]
temperature: [0.6, 0.8] # lower default; helps schema compliance
lora_rank: [8, 16, 32] # higher baseline; was [4, 8, 16]
target.basilica.setup_cmd: ensure openai (or compatible HTTP client) is installed if the reasoning distillation script needs to be re-run on the runner.
objective.metric: still eval_score, direction max. Unchanged.
stop.no_improve_streak: raise to 20 (was 15) — 6-field is harder, needs more exploration.
Acceptance
Context
Child of #90. Blocks on #104 (prepare.py) so the policy guidance can reference the right metrics.
program.mdis the task guidance the autoresearch-rl LLM policy reads to decide what hyperparameters to try and what code diffs to propose.config.yamlis the autoresearch-rl experiment config.Scope
1.
program.mdrewriteUpdate
autoresearch-rl/examples/security-judge/program.mdto:eval_score(composite of 7 components; weights ineval_protocol.json).The existing program.md has a good skeleton; the substance needs rewriting for 6-field.
2.
config.yamlupdatespolicy.params:target.basilica.setup_cmd: ensureopenai(or compatible HTTP client) is installed if the reasoning distillation script needs to be re-run on the runner.objective.metric: stilleval_score, directionmax. Unchanged.stop.no_improve_streak: raise to 20 (was 15) — 6-field is harder, needs more exploration.Acceptance
program.mdexplicitly lists all 7 reward components with their weights.config.yamlparam grid updated.autoresearch-rl run config.yaml --dry-run(or equivalent) reports the new search space correctly.