Skip to content

Regression: forked GRPO run can produce malformed checkpoint outputs on newer ART ref #672

@arcticfly

Description

@arcticfly

Summary

We are seeing a regression when moving from the fork-fix branch to a newer ART ref. With the same training setup and the same forked checkpoint, some runs on the newer ART ref produce malformed/repetitive generations from saved checkpoint artifacts after a short amount of GRPO training. The older branch has been clean in the same probe so far.

This does not look like an obvious failure to load the initial fork: initial eval metrics are in the same range across the two refs. The divergence appears after training continues from the forked checkpoint.

ART refs compared

  • Newer/current ref: codex/save-checkpoint-artifact at 4f972aa4328f16ad2d2b64a135e45a808c549e7c
  • Older comparison ref: fix/fork-on-pre-v5 at 6ecd16c8fcdfd2d2076eb1f0475e5b727ba4428c
  • Main at time of comparison: 48b2e5f6c384a62b44f34e1472e5fb1eeaa3474a
  • Package lock resolved both as openpipe-art==0.5.17 from the corresponding git refs.

Training setup

Sanitized setup details:

  • Base model: unsloth/Meta-Llama-3.1-8B-Instruct
  • Training starts by forking from an existing LoRA checkpoint at approximately step 686.
  • Infrastructure: SkyPilot on Kubernetes, H200:2.
  • GPU split: trainer on GPU 0, inference on GPU 1.
  • Rollout workers: 12.
  • GRPO-style training with group_size=6, batch_size=4.
  • learning_rate=1.2e-5
  • max_tokens=512
  • max_steps=716 for the comparison runs.
  • train_limit=768
  • eval_every=10, eval_samples=25
  • KL penalty enabled with kl_penalty_coef=1.0, kl_window_size=10.
  • Reward mode uses scorer-derived rewards plus a RULER judge.
  • RULER judge model: openrouter/google/gemini-2.5-flash.
  • save_checkpoint_artifact=true.

Observed behavior

After training from the same forked checkpoint and saving checkpoint artifacts, we probe the saved artifacts through W&B Inference using a small fixed set of generic transcript-cleanup prompts. The probe uses:

  • 3 generic prompts, not domain-specific.
  • n=6 completions per prompt.
  • temperature=0.7
  • max_tokens=512
  • System instruction asks the model to return only a corrected transcript wrapped in output tags.

We are intentionally not including the private prompts or outputs here. The failure mode is that the model starts producing malformed/repetitive/code-like text instead of a short cleaned transcript. In one run we saw repeated HeaderCode; in other failing outputs the literal token was not always present, but the responses were still clearly malformed or much longer than expected.

Current aggregate from the comparison probes:

  • Older ref (fix/fork-on-pre-v5): 6 runs completed/probed, 0 malformed responses out of 108 sampled completions.
  • Newer ref (codex/save-checkpoint-artifact): 4 runs completed/probed so far, 35 malformed responses out of 72 sampled completions.
  • Some newer-ref runs are clean while others fail badly, so this looks run-dependent rather than deterministic.

The initial eval score immediately after loading/forking is comparable between refs, which makes a simple fork-not-loaded explanation less likely. The issue appears after additional GRPO training and checkpoint save/reload.

Why this seems ART-related

The training config, forked checkpoint, data shape, inference probe, and infrastructure are held constant between the two refs. The older fix/fork-on-pre-v5 ref is consistently clean in the probe, while the newer ref intermittently produces corrupted checkpoint behavior.

Potential areas worth checking:

  • Changes around forked checkpoint loading/copying since fix/fork-on-pre-v5.
  • Save/reload behavior for LoRA checkpoint artifacts.
  • vLLM adapter lifecycle after checkpoint save/reload.
  • KL reference checkpoint/window handling when training from a forked checkpoint.
  • Any interaction between the dedicated trainer/inference split and adapter reloads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions