ops(judge-ft): execute training campaign + converge to acceptance gates

## Context

Child of #90. Blocks on #104, #105, #106.

This issue owns the actual training execution — spending GPU time, iterating on autoresearch-rl's hybrid policy, getting the model to clear the acceptance gates. It's the most open-ended of the children because "we're done when the numbers say so".

## Scope

### 1. Run setup

- Target: Basilica GPU cloud (A100 or H100; config already present in `examples/security-judge/config.yaml`).
- Runtime budget: 12 hours of GPU time max per try before escalating to hyperparameter review.
- Experiment name: `security-judge-6field-v1` (increment per major retry).

### 2. Hybrid-mode progression

autoresearch-rl's hybrid policy does param search until `no_improve_streak` hits `stall_threshold`, then switches to code diffs. Expected progression:

- **Iter 0–5**: random param draws to seed the history.
- **Iter 5–20**: llm-guided param tuning.
- **Iter 20+**: if param mode stalls, LLM proposes diffs to `train.py` reward weights and generation logic.

### 3. Monitoring

Check-in cadence:

- Live: autoresearch-rl dashboard (iteration history + metrics per run).
- Daily: summary post in team channel — best score so far, current params, what the LLM is trying.
- On each component regression > 10 % between iterations, flag in the issue thread.

### 4. Stop conditions

Stop training (claim success) when **all** of:

- `eval_score ≥ 0.80` on held-out val set.
- `json_compliance ≥ 0.98`.
- `is_threat_acc ≥ 0.87` (within 5 points of DeBERTa-protectai-v2 on the same data).
- `brier_score ≤ 0.15` (confidence calibrated).

Or: stop and re-scope when **any** of:

- Three consecutive 12-hour GPU runs show no improvement in best score.
- Total GPU cost exceeds $200 (approx; check with finance).
- A fundamental reward-hacking pathology is identified (model emits valid JSON but clearly gaming one component).

### 5. Failure branch

If stop-and-rescope: decide between (a) change base model to Qwen2.5-1.5B or 3B, (b) change teacher (regenerate reasoning with Claude-Haiku), (c) add curriculum learning (train on is_threat only first, then add category, etc.). Each rescope is its own sub-issue.

## Acceptance

- [ ] Acceptance gates met on val set.
- [ ] Best checkpoint archived at `autoresearch-rl/artifacts/security-judge-6field-v1/versions/best/` with weights + training config + metrics.
- [ ] Run log summarised in a `TRAINING_LOG.md` committed under the same directory.
- [ ] All seven reward components broken down in the final report (no hidden single-component failures).

## Reference issue for rescope options

Open at time of first stop-and-rescope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops(judge-ft): execute training campaign + converge to acceptance gates #107

Context

Scope

1. Run setup

2. Hybrid-mode progression

3. Monitoring

4. Stop conditions

5. Failure branch

Acceptance

Reference issue for rescope options

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ops(judge-ft): execute training campaign + converge to acceptance gates #107

Description

Context

Scope

1. Run setup

2. Hybrid-mode progression

3. Monitoring

4. Stop conditions

5. Failure branch

Acceptance

Reference issue for rescope options

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions