Context
Child of #90. Blocks on #104 (prepare.py rewrite).
autoresearch-rl/examples/security-judge/train.py is the mutable file — the LLM policy is permitted to modify it in llm_diff mode. This issue produces the initial version that supports the 6-field schema; subsequent iteration happens under autoresearch-rl's optimisation loop.
Scope
1. Reward function
Implement compute_reward(generation: str, target: dict, metrics_weights: dict) -> float with seven components:
| Component |
Weight |
Pass criterion |
valid_json |
0.20 |
Generation parses + every required field present |
is_threat_correct |
0.25 |
gen.is_threat == target.is_threat |
category_correct |
0.20 |
gen.category == target.category |
action_correct |
0.10 |
gen.recommended_action == target.recommended_action |
score_within_10 |
0.10 |
abs(gen.security_score - target.security_score) ≤ 10 |
confidence_calibrated |
0.10 |
1 - abs(gen.confidence - int(gen.is_threat == target.is_threat)) |
reasoning_quality |
0.05 |
Length ∈ [40, 400] + contains ≥ 1 signal word from the target.reasoning |
Weights live in eval_protocol.json (produced by #104), so changing them is a config edit — not a code edit.
2. GRPO configuration
Carry over the existing LoRA + GRPO scaffolding but tune for the larger output:
lora_rank hyperparameter now search over {8, 16, 32} (was {4, 8, 16}). 6-field JSON has materially more tokens to generate; rank 4 likely underfits.
num_generations at {3, 4} for GRPO rollout width.
max_new_tokens = 256 (target reasoning is ~200 chars + structural overhead).
3. Checkpointing
Save LoRA adapter every 25 steps. Keep the last 3 + the best-so-far (by eval_score). autoresearch-rl's iteration-resume machinery depends on this.
4. Hybrid-mode guidance
Expose key knobs (reward weights, temperature, LoRA rank, num_generations) via argparse so the hybrid policy can sweep them in param mode before falling back to code diffs. Document each flag in the --help output.
5. Metrics output
Print one line at end of training:
eval_score=0.712 json_compliance=0.982 is_threat_acc=0.891 category_acc=0.780 action_acc=0.854 score_mae=8.2 reasoning_len_med=203 brier=0.112
autoresearch-rl parses this line to feed the optimisation loop.
Acceptance
Notes
- This is the file the autoresearch LLM policy will modify during llm_diff mode. Keep it readable and comment each reward component with intent — the LLM proposes diffs better when the code self-documents its invariants.
Context
Child of #90. Blocks on #104 (prepare.py rewrite).
autoresearch-rl/examples/security-judge/train.pyis the mutable file — the LLM policy is permitted to modify it in llm_diff mode. This issue produces the initial version that supports the 6-field schema; subsequent iteration happens under autoresearch-rl's optimisation loop.Scope
1. Reward function
Implement
compute_reward(generation: str, target: dict, metrics_weights: dict) -> floatwith seven components:valid_jsonis_threat_correctgen.is_threat == target.is_threatcategory_correctgen.category == target.categoryaction_correctgen.recommended_action == target.recommended_actionscore_within_10abs(gen.security_score - target.security_score) ≤ 10confidence_calibrated1 - abs(gen.confidence - int(gen.is_threat == target.is_threat))reasoning_qualityWeights live in
eval_protocol.json(produced by #104), so changing them is a config edit — not a code edit.2. GRPO configuration
Carry over the existing LoRA + GRPO scaffolding but tune for the larger output:
lora_rankhyperparameter now search over{8, 16, 32}(was{4, 8, 16}). 6-field JSON has materially more tokens to generate; rank 4 likely underfits.num_generationsat{3, 4}for GRPO rollout width.max_new_tokens= 256 (target reasoning is ~200 chars + structural overhead).3. Checkpointing
Save LoRA adapter every 25 steps. Keep the last 3 + the best-so-far (by
eval_score). autoresearch-rl's iteration-resume machinery depends on this.4. Hybrid-mode guidance
Expose key knobs (reward weights, temperature, LoRA rank, num_generations) via argparse so the hybrid policy can sweep them in param mode before falling back to code diffs. Document each flag in the
--helpoutput.5. Metrics output
Print one line at end of training:
autoresearch-rl parses this line to feed the optimisation loop.
Acceptance
compute_rewardwith hand-crafted generations covering each component.Notes