Skip to content

feat(judge-ft): rewrite train.py with 7-component reward + 6-field GRPO #105

@epappas

Description

@epappas

Context

Child of #90. Blocks on #104 (prepare.py rewrite).

autoresearch-rl/examples/security-judge/train.py is the mutable file — the LLM policy is permitted to modify it in llm_diff mode. This issue produces the initial version that supports the 6-field schema; subsequent iteration happens under autoresearch-rl's optimisation loop.

Scope

1. Reward function

Implement compute_reward(generation: str, target: dict, metrics_weights: dict) -> float with seven components:

Component Weight Pass criterion
valid_json 0.20 Generation parses + every required field present
is_threat_correct 0.25 gen.is_threat == target.is_threat
category_correct 0.20 gen.category == target.category
action_correct 0.10 gen.recommended_action == target.recommended_action
score_within_10 0.10 abs(gen.security_score - target.security_score) ≤ 10
confidence_calibrated 0.10 1 - abs(gen.confidence - int(gen.is_threat == target.is_threat))
reasoning_quality 0.05 Length ∈ [40, 400] + contains ≥ 1 signal word from the target.reasoning

Weights live in eval_protocol.json (produced by #104), so changing them is a config edit — not a code edit.

2. GRPO configuration

Carry over the existing LoRA + GRPO scaffolding but tune for the larger output:

  • lora_rank hyperparameter now search over {8, 16, 32} (was {4, 8, 16}). 6-field JSON has materially more tokens to generate; rank 4 likely underfits.
  • num_generations at {3, 4} for GRPO rollout width.
  • max_new_tokens = 256 (target reasoning is ~200 chars + structural overhead).

3. Checkpointing

Save LoRA adapter every 25 steps. Keep the last 3 + the best-so-far (by eval_score). autoresearch-rl's iteration-resume machinery depends on this.

4. Hybrid-mode guidance

Expose key knobs (reward weights, temperature, LoRA rank, num_generations) via argparse so the hybrid policy can sweep them in param mode before falling back to code diffs. Document each flag in the --help output.

5. Metrics output

Print one line at end of training:

eval_score=0.712 json_compliance=0.982 is_threat_acc=0.891 category_acc=0.780 action_acc=0.854 score_mae=8.2 reasoning_len_med=203 brier=0.112

autoresearch-rl parses this line to feed the optimisation loop.

Acceptance

  • First training run completes on a CPU-or-small-GPU dev machine within 5 minutes using a tiny subset (sanity check).
  • Reward components logged individually per batch (for debugging reward hacking).
  • Metrics line format matches the autoresearch-rl JSON regex in its objective parser.
  • Unit tests for compute_reward with hand-crafted generations covering each component.

Notes

  • This is the file the autoresearch LLM policy will modify during llm_diff mode. Keep it readable and comment each reward component with intent — the LLM proposes diffs better when the code self-documents its invariants.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions