Short answer: the run is technically working, but optimization quality is weak and expensive right now.
- Setup is stable (model loads, LoRA trainable params ~1.72%, steps progressing).
- Reward briefly improves early (
-0.44 -> -0.04), so learning signal exists.
- Reward is still mostly negative, including eval (
eval_reward ~ -0.29then-0.27), and unstable. completions/clipped_ratiois very high (~0.8–0.9) which means generations hit max length constantly (wasted compute + noisy reward).max_prompt_lengthis ignored ([setup] ignoring unsupported GRPOConfig args: max_prompt_length) so long prompts are likely unbounded.clip_ratio/*is all zero every log, suggesting policy updates may be weak/uninformative.- Training is slow (
~10.5s/step, eval ~14 min) because rollout settings are heavy.
- Fix reward heuristic bug: remove
"i can"from refusal markers (it penalizes benign helpful outputs). - Reduce rollout cost:
num_generations: 8 -> 2max_completion_length: 384/192 -> 96(or128)
- Lower LR for stability:
5e-6 -> 1e-6 or 2e-6
- Enforce prompt truncation in preprocessing (since config arg is ignored) by truncating tokenized prompt before training.
- Run shorter debug runs first (
max_steps 200-400) and track:- malicious refusal rate
- benign helpfulness rate
- eval reward trend
Patch the training script with these changes first, then run a short controlled experiment to verify reward trend and latency before full training.