Regression: forked GRPO run can produce malformed checkpoint outputs on newer ART ref

## Summary

We are seeing a regression when moving from the fork-fix branch to a newer ART ref. With the same training setup and the same forked checkpoint, some runs on the newer ART ref produce malformed/repetitive generations from saved checkpoint artifacts after a short amount of GRPO training. The older branch has been clean in the same probe so far.

This does not look like an obvious failure to load the initial fork: initial eval metrics are in the same range across the two refs. The divergence appears after training continues from the forked checkpoint.

## ART refs compared

- Newer/current ref: `codex/save-checkpoint-artifact` at `4f972aa4328f16ad2d2b64a135e45a808c549e7c`
- Older comparison ref: `fix/fork-on-pre-v5` at `6ecd16c8fcdfd2d2076eb1f0475e5b727ba4428c`
- Main at time of comparison: `48b2e5f6c384a62b44f34e1472e5fb1eeaa3474a`
- Package lock resolved both as `openpipe-art==0.5.17` from the corresponding git refs.

## Training setup

Sanitized setup details:

- Base model: `unsloth/Meta-Llama-3.1-8B-Instruct`
- Training starts by forking from an existing LoRA checkpoint at approximately step 686.
- Infrastructure: SkyPilot on Kubernetes, `H200:2`.
- GPU split: trainer on GPU 0, inference on GPU 1.
- Rollout workers: 12.
- GRPO-style training with `group_size=6`, `batch_size=4`.
- `learning_rate=1.2e-5`
- `max_tokens=512`
- `max_steps=716` for the comparison runs.
- `train_limit=768`
- `eval_every=10`, `eval_samples=25`
- KL penalty enabled with `kl_penalty_coef=1.0`, `kl_window_size=10`.
- Reward mode uses scorer-derived rewards plus a RULER judge.
- RULER judge model: `openrouter/google/gemini-2.5-flash`.
- `save_checkpoint_artifact=true`.

## Observed behavior

After training from the same forked checkpoint and saving checkpoint artifacts, we probe the saved artifacts through W&B Inference using a small fixed set of generic transcript-cleanup prompts. The probe uses:

- 3 generic prompts, not domain-specific.
- `n=6` completions per prompt.
- `temperature=0.7`
- `max_tokens=512`
- System instruction asks the model to return only a corrected transcript wrapped in output tags.

We are intentionally not including the private prompts or outputs here. The failure mode is that the model starts producing malformed/repetitive/code-like text instead of a short cleaned transcript. In one run we saw repeated `HeaderCode`; in other failing outputs the literal token was not always present, but the responses were still clearly malformed or much longer than expected.

Current aggregate from the comparison probes:

- Older ref (`fix/fork-on-pre-v5`): 6 runs completed/probed, 0 malformed responses out of 108 sampled completions.
- Newer ref (`codex/save-checkpoint-artifact`): 4 runs completed/probed so far, 35 malformed responses out of 72 sampled completions.
- Some newer-ref runs are clean while others fail badly, so this looks run-dependent rather than deterministic.

The initial eval score immediately after loading/forking is comparable between refs, which makes a simple fork-not-loaded explanation less likely. The issue appears after additional GRPO training and checkpoint save/reload.

## Why this seems ART-related

The training config, forked checkpoint, data shape, inference probe, and infrastructure are held constant between the two refs. The older `fix/fork-on-pre-v5` ref is consistently clean in the probe, while the newer ref intermittently produces corrupted checkpoint behavior.

Potential areas worth checking:

- Changes around forked checkpoint loading/copying since `fix/fork-on-pre-v5`.
- Save/reload behavior for LoRA checkpoint artifacts.
- vLLM adapter lifecycle after checkpoint save/reload.
- KL reference checkpoint/window handling when training from a forked checkpoint.
- Any interaction between the dedicated trainer/inference split and adapter reloads.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: forked GRPO run can produce malformed checkpoint outputs on newer ART ref #672

Summary

ART refs compared

Training setup

Observed behavior

Why this seems ART-related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Regression: forked GRPO run can produce malformed checkpoint outputs on newer ART ref #672

Description

Summary

ART refs compared

Training setup

Observed behavior

Why this seems ART-related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions