Skip to content

ops(judge-ft): execute training campaign + converge to acceptance gates #107

@epappas

Description

@epappas

Context

Child of #90. Blocks on #104, #105, #106.

This issue owns the actual training execution — spending GPU time, iterating on autoresearch-rl's hybrid policy, getting the model to clear the acceptance gates. It's the most open-ended of the children because "we're done when the numbers say so".

Scope

1. Run setup

  • Target: Basilica GPU cloud (A100 or H100; config already present in examples/security-judge/config.yaml).
  • Runtime budget: 12 hours of GPU time max per try before escalating to hyperparameter review.
  • Experiment name: security-judge-6field-v1 (increment per major retry).

2. Hybrid-mode progression

autoresearch-rl's hybrid policy does param search until no_improve_streak hits stall_threshold, then switches to code diffs. Expected progression:

  • Iter 0–5: random param draws to seed the history.
  • Iter 5–20: llm-guided param tuning.
  • Iter 20+: if param mode stalls, LLM proposes diffs to train.py reward weights and generation logic.

3. Monitoring

Check-in cadence:

  • Live: autoresearch-rl dashboard (iteration history + metrics per run).
  • Daily: summary post in team channel — best score so far, current params, what the LLM is trying.
  • On each component regression > 10 % between iterations, flag in the issue thread.

4. Stop conditions

Stop training (claim success) when all of:

  • eval_score ≥ 0.80 on held-out val set.
  • json_compliance ≥ 0.98.
  • is_threat_acc ≥ 0.87 (within 5 points of DeBERTa-protectai-v2 on the same data).
  • brier_score ≤ 0.15 (confidence calibrated).

Or: stop and re-scope when any of:

  • Three consecutive 12-hour GPU runs show no improvement in best score.
  • Total GPU cost exceeds $200 (approx; check with finance).
  • A fundamental reward-hacking pathology is identified (model emits valid JSON but clearly gaming one component).

5. Failure branch

If stop-and-rescope: decide between (a) change base model to Qwen2.5-1.5B or 3B, (b) change teacher (regenerate reasoning with Claude-Haiku), (c) add curriculum learning (train on is_threat only first, then add category, etc.). Each rescope is its own sub-issue.

Acceptance

  • Acceptance gates met on val set.
  • Best checkpoint archived at autoresearch-rl/artifacts/security-judge-6field-v1/versions/best/ with weights + training config + metrics.
  • Run log summarised in a TRAINING_LOG.md committed under the same directory.
  • All seven reward components broken down in the final report (no hidden single-component failures).

Reference issue for rescope options

Open at time of first stop-and-rescope.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions