feat(judge-ft): rewrite train.py with 7-component reward + 6-field GRPO

## Context

Child of #90. Blocks on #104 (prepare.py rewrite).

`autoresearch-rl/examples/security-judge/train.py` is the mutable file — the LLM policy is permitted to modify it in llm_diff mode. This issue produces the initial version that supports the 6-field schema; subsequent iteration happens under autoresearch-rl's optimisation loop.

## Scope

### 1. Reward function

Implement `compute_reward(generation: str, target: dict, metrics_weights: dict) -> float` with seven components:

| Component | Weight | Pass criterion |
|---|---:|---|
| `valid_json` | 0.20 | Generation parses + every required field present |
| `is_threat_correct` | 0.25 | `gen.is_threat == target.is_threat` |
| `category_correct` | 0.20 | `gen.category == target.category` |
| `action_correct` | 0.10 | `gen.recommended_action == target.recommended_action` |
| `score_within_10` | 0.10 | `abs(gen.security_score - target.security_score) ≤ 10` |
| `confidence_calibrated` | 0.10 | `1 - abs(gen.confidence - int(gen.is_threat == target.is_threat))` |
| `reasoning_quality` | 0.05 | Length ∈ [40, 400] + contains ≥ 1 signal word from the target.reasoning |

Weights live in `eval_protocol.json` (produced by #104), so changing them is a config edit — not a code edit.

### 2. GRPO configuration

Carry over the existing LoRA + GRPO scaffolding but tune for the larger output:

- `lora_rank` hyperparameter now search over `{8, 16, 32}` (was `{4, 8, 16}`). 6-field JSON has materially more tokens to generate; rank 4 likely underfits.
- `num_generations` at `{3, 4}` for GRPO rollout width.
- `max_new_tokens` = 256 (target reasoning is ~200 chars + structural overhead).

### 3. Checkpointing

Save LoRA adapter every 25 steps. Keep the last 3 + the best-so-far (by `eval_score`). autoresearch-rl's iteration-resume machinery depends on this.

### 4. Hybrid-mode guidance

Expose key knobs (reward weights, temperature, LoRA rank, num_generations) via argparse so the hybrid policy can sweep them in param mode before falling back to code diffs. Document each flag in the `--help` output.

### 5. Metrics output

Print one line at end of training:

```
eval_score=0.712 json_compliance=0.982 is_threat_acc=0.891 category_acc=0.780 action_acc=0.854 score_mae=8.2 reasoning_len_med=203 brier=0.112
```

autoresearch-rl parses this line to feed the optimisation loop.

## Acceptance

- [ ] First training run completes on a CPU-or-small-GPU dev machine within 5 minutes using a tiny subset (sanity check).
- [ ] Reward components logged individually per batch (for debugging reward hacking).
- [ ] Metrics line format matches the autoresearch-rl JSON regex in its objective parser.
- [ ] Unit tests for `compute_reward` with hand-crafted generations covering each component.

## Notes

- This is the file the autoresearch LLM policy will modify during llm_diff mode. Keep it readable and comment each reward component with intent — the LLM proposes diffs better when the code self-documents its invariants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(judge-ft): rewrite train.py with 7-component reward + 6-field GRPO #105

Context

Scope

1. Reward function

2. GRPO configuration

3. Checkpointing

4. Hybrid-mode guidance

5. Metrics output

Acceptance

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Weight	Pass criterion
`valid_json`	0.20	Generation parses + every required field present
`is_threat_correct`	0.25	`gen.is_threat == target.is_threat`
`category_correct`	0.20	`gen.category == target.category`
`action_correct`	0.10	`gen.recommended_action == target.recommended_action`
`score_within_10`	0.10	`abs(gen.security_score - target.security_score) ≤ 10`
`confidence_calibrated`	0.10	`1 - abs(gen.confidence - int(gen.is_threat == target.is_threat))`
`reasoning_quality`	0.05	Length ∈ [40, 400] + contains ≥ 1 signal word from the target.reasoning

feat(judge-ft): rewrite train.py with 7-component reward + 6-field GRPO #105

Description

Context

Scope

1. Reward function

2. GRPO configuration

3. Checkpointing

4. Hybrid-mode guidance

5. Metrics output

Acceptance

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions