Skip to content

feature(zsh): migrate URSA-MATH stage3 training to LightRFT#53

Draft
HansBug wants to merge 6 commits intoopendilab:mainfrom
HansBug:dev/math_prm_train
Draft

feature(zsh): migrate URSA-MATH stage3 training to LightRFT#53
HansBug wants to merge 6 commits intoopendilab:mainfrom
HansBug:dev/math_prm_train

Conversation

@HansBug
Copy link
Copy Markdown
Member

@HansBug HansBug commented Mar 18, 2026

Summary

This PR migrates the URSA-MATH Stage 3 training path into LightRFT under the current frozen Docker baseline, and now also trims the example directory down to the URSA-MATH Stage 3 surface instead of keeping older unrelated example baggage.

Current high-level state:

  • the LightRFT Stage 3 training chain is functionally through
  • local hf rollout for URSA is working and has standalone proofs / regression coverage
  • PS-GRPO reward semantics, answer extraction, and launcher alignment are implemented
  • Phase 7 observation has been restored to a healthy format/stability pass
  • the main remaining engineering issue is rollout performance, not basic chain correctness

Working notes that still exist during the migration:

  • plan/MATH_PRM.md
  • plan/URSA_ROLLOUT_ENGINE_FAILURE_ANALYSIS.md
  • plan/PHASE7_FORMAT_STABILITY_ANALYSIS.md
  • plan/PHASE7_HF_ROLLOUT_PERFORMANCE_ANALYSIS.md

Status Map

This section is the current project map.

Phase 1: Data Path / Schema / Scope

Status: done

Brief:

  • the raw URSA Stage 3 data path has been converted into the LightRFT-facing prompt / images / reference / label manifest path
  • the full converted MMathCoT-1M manifest is intentionally used first
  • the paper's later filtering pipeline is still deferred to later phases

Checklist:

  • freeze the current Docker/runtime baseline instead of solving integration by dependency drift
  • convert raw URSA schema into LightRFT schema
  • verify dataset/image loading on the converted manifest
  • confirm PromptDatasetVL can consume the manifest
  • confirm dataloader outputs can reach reward-model inputs correctly
  • keep the paper filtering pipeline out of the critical path for now

Phase 2: URSA / PRM Alignment

Status: done

Brief:

  • URSA actor / PRM loading, processor usage, multimodal reward-model inputs, step markers, and score aggregation semantics are aligned and smoke-verified

Checklist:

  • load URSA-8B with explicit UrsaProcessor.from_pretrained(...)
  • force URSA-RM-8B to stay on the direct HF path
  • pass real images into MathPRMReward
  • preserve URSA step-marker / image-padding / step-logit semantics
  • add URSA runtime compatibility fixes under the current Docker baseline
  • add unit tests and a real smoke alignment check
  • verify sample-level alignment against the reference implementation

Phase 3: Full-Data Baseline math_prm Training Chain

Status: done

Brief:

  • the full RL chain now reaches dataloader -> rollout -> reward -> PPO train -> checkpoint / trajectory save -> cleanup
  • the major stopping / long-tail corruption issue has been repaired through multiple smoke rounds
  • local hf rollout is now the stable engineering path for URSA under the frozen runtime
  • the remaining open issue is not basic Phase 3 wiring anymore

Checklist:

  • keep Phase 3 reward as baseline math_prm = min(step_scores)
  • exclude the unrelated global <think>-style format reward from the effective math_prm path
  • run time-boxed Phase 3 smoke jobs with explicit cleanup and GPU release checks
  • make the smoke reach dataloader -> rollout -> reward -> PPO train -> cleanup
  • repair the major stopping / long-tail corruption behavior across three smoke rounds
  • bring rollout response length back from the pathological ~943-946 regime to a reasonable smoke regime
  • add a standalone local-HF rollout proof script for URSA
  • close the Phase 3 engineering chain as working
  • treat model-quality / correctness as solved

Phase 4: PS-GRPO Reward Semantics

Status: done

Brief:

  • the reward path has been upgraded from the Phase 3 baseline min(step_scores) to the paper-aligned PS-GRPO-style reward path
  • math_prm is preserved as the baseline label and math_psgrpo is introduced as the Stage 3 reward path

Checklist:

  • introduce a distinct Stage 3 reward path math_psgrpo
  • keep math_prm reserved as the Phase 3 baseline reward
  • collect complete step_scores
  • implement relative-drop calculation and rho = 0.3 drop-moment detection
  • implement final-answer extraction and reference normalization
  • implement correctness judgement
  • implement the gamma = 0.5 reward mapping
  • verify that reward outcomes match the paper cases: 1.0 / 0.5 / 0.0
  • log step_scores, max_relative_drop, has_drop_moment, outcome_correct, and final_reward

Phase 5: Answer Extraction / Correctness Alignment

Status: done

Brief:

  • answer extraction and correctness alignment are now handled explicitly instead of relying on loose heuristic extraction from intermediate reasoning text

Checklist:

  • define answer-judgement strategy by problem type: multiple-choice / numeric / formula / text / missing reference
  • reuse mathruler where appropriate
  • ensure intermediate reasoning steps are not mistaken for final answers
  • define fallback behavior when †Answer: is missing
  • define fallback behavior for empty / malformed / unsupported references
  • add regression checks for the controlled fallback behavior
  • keep the resulting metrics visible in the real reward aggregation path

Phase 6: Training Script / Hyperparameter Alignment

Status: done

Brief:

  • the Stage 3 launcher defaults are now aligned to the current target path and the script includes explicit preflight checks for the expected model paths, manifest path, reward label, Docker baseline, and batch divisibility

Checklist:

  • switch the formal Stage 3 reward label to the real Phase 4 reward path
  • keep all model paths and dataset paths explicit and non-placeholder
  • remove script options that conflict with PRM direct-HF usage
  • document the frozen Docker baseline as a hard constraint
  • audit the script against the paper table and record the differences
  • move toward the paper targets for n_samples_per_prompt, temperature, init_kl_coef, actor_learning_rate, prompt_max_len, and generate_max_len
  • verify the current train_batch_size = 512 implementation strategy under 8 GPUs
  • document the effective gradient-accumulation layout in the launcher itself

Phase 7: Full-Data Training Observation / Stability Validation

Status: done with follow-up

Brief:

  • Phase 7 observation now produces a real bounded observation result again
  • the earlier format-stability failure has been repaired and the latest observation returns healthy_pass = true
  • the remaining follow-up is rollout performance, not format collapse or missing trajectories

Checklist:

  • measure reward distribution statistics
  • measure correctness ratio
  • measure drop-moment hit ratio
  • measure final-answer extraction failure ratio
  • measure image-read failure ratio
  • measure PRM inference failure ratio
  • inspect KL stability
  • inspect actor-loss / policy-gradient stability
  • sample generated texts for sustained Step N: / †Answer: compliance
  • produce a concrete next-fix list from the observation results
  • restore healthy_pass = true for the observation health gate
  • bring rollout speed closer to the standalone URSA baseline

Phase 8: Paper Data Filtering Pipeline

Status: planned

Brief:

  • the paper-style 20K candidate -> 8 samples -> remove all-correct/all-wrong -> 15.3K pipeline is still intentionally deferred until the chain, reward path, and observation loop are stable

Checklist:

  • add an offline data-preparation script instead of hiding filtering inside the training script
  • sample 20K candidates from full MMathCoT-1M
  • sample 8 outputs per prompt with URSA-8B
  • score correctness for each sample
  • remove prompts that are all-correct or all-wrong across the 8 samples
  • produce the filtered Stage 3 dataset at roughly 15.3K
  • keep the filtered dataset in the same LightRFT-facing manifest schema
  • verify the filtered dataset is still directly consumable by PromptDatasetVL

Phase 9: Reproduction Close-Out

Status: planned

Brief:

  • final consolidation after filtered-data training is stable

Checklist:

  • summarize remaining gaps against the paper
  • separate engineering compromises from unfinished work
  • refresh the docs for reward / data / script / hyperparameter status
  • document the minimal reproduction flow
  • organize three run modes: smoke / full-data / filtered-data
  • recommend the final default script and dataset entry point

Detailed Updates Since The Earlier PR State

The earlier PR body was effectively frozen around "Phase 4 ready to start". That is no longer accurate.

What has been completed since then:

  • Phase 4 PS-GRPO reward semantics
  • Phase 5 answer extraction / correctness alignment
  • Phase 6 launcher / hyperparameter alignment and preflight checks
  • Phase 7 observation repair back to healthy_pass = true
  • rollout performance diagnosis with both direct URSA baselines and rollout-like probe scripts
  • example-directory cleanup so the math_prm folder is now centered on URSA-MATH Stage 3 rather than older unrelated example baggage

Rollout / Observation State

Local HF rollout

A standalone validation script now exists at:

  • examples/math_prm/tools/check_hf_rollout.py

This script proves that the local LightRFT hf rollout path for URSA is actually working by:

  • loading URSA-8B
  • calling setup_inference_engine(engine_type="hf")
  • running a real gather_and_generate()
  • comparing the rollout outputs against direct actor.generate() token by token

Phase 7 health state

The repaired observation run now reports:

  • healthy_pass = true
  • format_success_ratio = 1.0

At the same time, it still shows that model quality is not solved yet, for example:

  • correctness_ratio = 0.25

So the current interpretation is:

  • functional chain health is back
  • answer quality is still limited
  • performance is still the main engineering follow-up

Rollout performance diagnosis

The current performance story is now much clearer.

Direct standalone URSA generation is not the source of the pathological slowdown. Probe results on rollout-like workloads show:

  • fsdp_train_gc = 683.406s
  • fsdp_train_no_gc = 68.869s
  • fsdp_eval_no_gc = 65.816s
  • raw_eval_no_gc = 44.139s

This makes the current diagnosis much more concrete:

  • the main slowdown is not "URSA is naturally slow"
  • the dominant issue is rollout decode under the training-style FSDP + gradient_checkpointing configuration
  • the key remaining performance task is to close that gap as much as possible under the current LightRFT constraints

Example Directory Cleanup

This branch now also cleans up examples/math_prm/ itself.

Current layout intent:

  • top-level files are only the active training surface and self-contained URSA runtime pieces
  • support scripts, smoke tools, observation tools, and regression checks live under examples/math_prm/tools/
  • reward_models.py and reward_models_utils.py are now math-only and trimmed to the URSA-MATH Stage 3 path
  • URSA_MIGRATION.md and plan/* are explicitly treated as temporary working docs to delete after the migration is closed out

Key Files

Main work in this branch now spans:

  • examples/math_prm/train_colocate.py
  • examples/math_prm/run_grpo_math_prm_ursa_8b.sh
  • examples/math_prm/reward_models.py
  • examples/math_prm/reward_models_utils.py
  • examples/math_prm/ursa_actor.py
  • examples/math_prm/sitecustomize.py
  • examples/math_prm/ursa_model/
  • examples/math_prm/tools/prepare_ursa_stage3_manifest.py
  • examples/math_prm/tools/check_phase2_alignment.py
  • examples/math_prm/tools/check_hf_rollout.py
  • examples/math_prm/tools/check_phase6_script_alignment.py
  • examples/math_prm/tools/test_phase2_alignment.py
  • examples/math_prm/tools/run_phase3_smoke.sh
  • examples/math_prm/tools/run_phase7_observation.sh
  • examples/math_prm/tools/analyze_phase7_observation.py
  • examples/math_prm/tools/probe_rollout_speed_candidates.py
  • lightrft/strategy/strategy_base.py
  • lightrft/trainer/fast_exp_maker.py
  • lightrft/models/actor_language.py
  • lightrft/models/actor_vl.py
  • lightrft/utils/math_prm_output.py
  • plan/MATH_PRM.md
  • plan/URSA_ROLLOUT_ENGINE_FAILURE_ANALYSIS.md
  • plan/PHASE7_FORMAT_STABILITY_ANALYSIS.md
  • plan/PHASE7_HF_ROLLOUT_PERFORMANCE_ANALYSIS.md

Testing

Commands already run in this branch include:

python -m unittest -q examples.math_prm.tools.test_phase2_alignment
python examples/math_prm/tools/check_phase2_alignment.py --device cuda:0
python examples/math_prm/tools/check_phase6_script_alignment.py
python examples/math_prm/tools/check_hf_rollout.py --output-json /data/LightRFT/tmp/ursa_stage3/hf_rollout_check.json
bash examples/math_prm/tools/run_phase3_smoke.sh
bash examples/math_prm/tools/run_phase7_observation.sh
bash -n examples/math_prm/run_grpo_math_prm_ursa_8b.sh

Current testing conclusion:

  • Phase 2 regression / alignment tests passed
  • Phase 2 real smoke alignment matched the reference sample
  • Phase 3 smoke reaches PPO training and cleans up correctly
  • local hf rollout has a standalone minimal proof script and currently passes
  • Phase 4 / 5 reward-path regressions are covered in the current math-only regression suite
  • Phase 6 launcher alignment check passes
  • Phase 7 observation health has been restored to a passing state
  • rollout performance is normalized to the standalone URSA expectation yet

Review Framing

The most accurate review framing at this point is:

  • Phase 1 done
  • Phase 2 done
  • Phase 3 done as a working engineering baseline
  • Phase 4 done
  • Phase 5 done
  • Phase 6 done
  • Phase 7 done with a remaining performance follow-up
  • Phase 8 and Phase 9 not started yet

So the dominant open question is no longer whether URSA-MATH Stage 3 can run in LightRFT at all.
The dominant open question is how much of the remaining performance gap can be closed while staying within the current LightRFT / frozen-runtime constraints.

@HansBug HansBug marked this pull request as draft March 18, 2026 07:34
@HansBug HansBug force-pushed the dev/math_prm_train branch from f567589 to 9a7fec2 Compare March 18, 2026 14:11
- keep the URSA-MATH stage3 training path and required runtime wiring
- retain the bilingual README files while limiting them to the minimal upstream surface
- leave validation, profiling, migration notes, and local planning artifacts on the full working branch
@HansBug HansBug force-pushed the dev/math_prm_train branch from 10c5f3f to fec2744 Compare March 20, 2026 11:22
@puyuan1996 puyuan1996 added documentation Improvements or additions to documentation enhancement New feature or request labels Mar 20, 2026
HansBug added 2 commits March 21, 2026 02:01
Selectively sync the effective Stage 3 rollout changes from dev/math_prm_train_working into the upstream PR branch.

- add the separate local HF rollout actor option to the PR-surface strategy path
- carry over the current launcher and train_colocate updates needed for the rollout path
- keep working-only docs, plans, tmp files, and auxiliary scripts out of dev/math_prm_train
@puyuan1996 puyuan1996 changed the title feature(math_prm): migrate URSA-MATH stage3 training to LightRFT feature(zsh): migrate URSA-MATH stage3 training to LightRFT Mar 24, 2026
HansBug added 3 commits March 26, 2026 14:57
sync the current stage3 runtime-eval path from dev/math_prm_train_working into the slim PR branch while keeping the documented PR surface consistent.

- add the example-local math_prm trainer wrapper required by train_colocate.py
- carry over runtime eval, separate HF rollout, and related strategy/cli updates
- trim README references so the slim branch no longer points at non-migrated helper docs and scripts
clean existing trailing whitespace in the slim math_prm branch so branch-level diff --check passes after the sync.

- strip trailing spaces from train_colocate and the URSA model files already carried by dev/math_prm_train
- keep the change whitespace-only with no behavior updates
Sync the separate local HF rollout actor refresh fix from dev/math_prm_train_working without bringing plan materials into the PR branch.

- explicitly reload the keep-on-gpu rollout actor after copying updated actor weights
- preserve the rollout sync timing fields for debugging
- source change corresponds to working branch commit 8c77921
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants