feature(zsh): migrate URSA-MATH stage3 training to LightRFT by HansBug · Pull Request #53 · opendilab/LightRFT

HansBug · 2026-03-18T07:32:00Z

Summary

This PR migrates the URSA-MATH Stage 3 training path into LightRFT under the current frozen Docker baseline, and now also trims the example directory down to the URSA-MATH Stage 3 surface instead of keeping older unrelated example baggage.

Current high-level state:

the LightRFT Stage 3 training chain is functionally through
local hf rollout for URSA is working and has standalone proofs / regression coverage
PS-GRPO reward semantics, answer extraction, and launcher alignment are implemented
Phase 7 observation has been restored to a healthy format/stability pass
the main remaining engineering issue is rollout performance, not basic chain correctness

Working notes that still exist during the migration:

plan/MATH_PRM.md
plan/URSA_ROLLOUT_ENGINE_FAILURE_ANALYSIS.md
plan/PHASE7_FORMAT_STABILITY_ANALYSIS.md
plan/PHASE7_HF_ROLLOUT_PERFORMANCE_ANALYSIS.md

Status Map

This section is the current project map.

Phase 1: Data Path / Schema / Scope

Status: done

Brief:

the raw URSA Stage 3 data path has been converted into the LightRFT-facing prompt / images / reference / label manifest path
the full converted MMathCoT-1M manifest is intentionally used first
the paper's later filtering pipeline is still deferred to later phases

Checklist:

freeze the current Docker/runtime baseline instead of solving integration by dependency drift
convert raw URSA schema into LightRFT schema
verify dataset/image loading on the converted manifest
confirm PromptDatasetVL can consume the manifest
confirm dataloader outputs can reach reward-model inputs correctly
keep the paper filtering pipeline out of the critical path for now

Phase 2: URSA / PRM Alignment

Status: done

Brief:

URSA actor / PRM loading, processor usage, multimodal reward-model inputs, step markers, and score aggregation semantics are aligned and smoke-verified

Checklist:

load URSA-8B with explicit UrsaProcessor.from_pretrained(...)
force URSA-RM-8B to stay on the direct HF path
pass real images into MathPRMReward
preserve URSA step-marker / image-padding / step-logit semantics
add URSA runtime compatibility fixes under the current Docker baseline
add unit tests and a real smoke alignment check
verify sample-level alignment against the reference implementation

Phase 3: Full-Data Baseline `math_prm` Training Chain

Status: done

Brief:

the full RL chain now reaches dataloader -> rollout -> reward -> PPO train -> checkpoint / trajectory save -> cleanup
the major stopping / long-tail corruption issue has been repaired through multiple smoke rounds
local hf rollout is now the stable engineering path for URSA under the frozen runtime
the remaining open issue is not basic Phase 3 wiring anymore

Checklist:

keep Phase 3 reward as baseline math_prm = min(step_scores)
exclude the unrelated global <think>-style format reward from the effective math_prm path
run time-boxed Phase 3 smoke jobs with explicit cleanup and GPU release checks
make the smoke reach dataloader -> rollout -> reward -> PPO train -> cleanup
repair the major stopping / long-tail corruption behavior across three smoke rounds
bring rollout response length back from the pathological ~943-946 regime to a reasonable smoke regime
add a standalone local-HF rollout proof script for URSA
close the Phase 3 engineering chain as working
treat model-quality / correctness as solved

Phase 4: PS-GRPO Reward Semantics

Status: done

Brief:

the reward path has been upgraded from the Phase 3 baseline min(step_scores) to the paper-aligned PS-GRPO-style reward path
math_prm is preserved as the baseline label and math_psgrpo is introduced as the Stage 3 reward path

Checklist:

introduce a distinct Stage 3 reward path math_psgrpo
keep math_prm reserved as the Phase 3 baseline reward
collect complete step_scores
implement relative-drop calculation and rho = 0.3 drop-moment detection
implement final-answer extraction and reference normalization
implement correctness judgement
implement the gamma = 0.5 reward mapping
verify that reward outcomes match the paper cases: 1.0 / 0.5 / 0.0
log step_scores, max_relative_drop, has_drop_moment, outcome_correct, and final_reward

Phase 5: Answer Extraction / Correctness Alignment

Status: done

Brief:

answer extraction and correctness alignment are now handled explicitly instead of relying on loose heuristic extraction from intermediate reasoning text

Checklist:

define answer-judgement strategy by problem type: multiple-choice / numeric / formula / text / missing reference
reuse mathruler where appropriate
ensure intermediate reasoning steps are not mistaken for final answers
define fallback behavior when †Answer: is missing
define fallback behavior for empty / malformed / unsupported references
add regression checks for the controlled fallback behavior
keep the resulting metrics visible in the real reward aggregation path

Phase 6: Training Script / Hyperparameter Alignment

Status: done

Brief:

the Stage 3 launcher defaults are now aligned to the current target path and the script includes explicit preflight checks for the expected model paths, manifest path, reward label, Docker baseline, and batch divisibility

Checklist:

switch the formal Stage 3 reward label to the real Phase 4 reward path
keep all model paths and dataset paths explicit and non-placeholder
remove script options that conflict with PRM direct-HF usage
document the frozen Docker baseline as a hard constraint
audit the script against the paper table and record the differences
move toward the paper targets for n_samples_per_prompt, temperature, init_kl_coef, actor_learning_rate, prompt_max_len, and generate_max_len
verify the current train_batch_size = 512 implementation strategy under 8 GPUs
document the effective gradient-accumulation layout in the launcher itself

Phase 7: Full-Data Training Observation / Stability Validation

Status: done with follow-up

Brief:

Phase 7 observation now produces a real bounded observation result again
the earlier format-stability failure has been repaired and the latest observation returns healthy_pass = true
the remaining follow-up is rollout performance, not format collapse or missing trajectories

Checklist:

Phase 8: Paper Data Filtering Pipeline

Status: planned

Brief:

the paper-style 20K candidate -> 8 samples -> remove all-correct/all-wrong -> 15.3K pipeline is still intentionally deferred until the chain, reward path, and observation loop are stable

Checklist:

add an offline data-preparation script instead of hiding filtering inside the training script
sample 20K candidates from full MMathCoT-1M
sample 8 outputs per prompt with URSA-8B
score correctness for each sample
remove prompts that are all-correct or all-wrong across the 8 samples
produce the filtered Stage 3 dataset at roughly 15.3K
keep the filtered dataset in the same LightRFT-facing manifest schema
verify the filtered dataset is still directly consumable by PromptDatasetVL

Phase 9: Reproduction Close-Out

Status: planned

Brief:

final consolidation after filtered-data training is stable

Checklist:

summarize remaining gaps against the paper
separate engineering compromises from unfinished work
refresh the docs for reward / data / script / hyperparameter status
document the minimal reproduction flow
organize three run modes: smoke / full-data / filtered-data
recommend the final default script and dataset entry point

Detailed Updates Since The Earlier PR State

The earlier PR body was effectively frozen around "Phase 4 ready to start". That is no longer accurate.

What has been completed since then:

Phase 4 PS-GRPO reward semantics
Phase 5 answer extraction / correctness alignment
Phase 6 launcher / hyperparameter alignment and preflight checks
Phase 7 observation repair back to healthy_pass = true
rollout performance diagnosis with both direct URSA baselines and rollout-like probe scripts
example-directory cleanup so the math_prm folder is now centered on URSA-MATH Stage 3 rather than older unrelated example baggage

Rollout / Observation State

Local HF rollout

A standalone validation script now exists at:

examples/math_prm/tools/check_hf_rollout.py

This script proves that the local LightRFT hf rollout path for URSA is actually working by:

loading URSA-8B
calling setup_inference_engine(engine_type="hf")
running a real gather_and_generate()
comparing the rollout outputs against direct actor.generate() token by token

Phase 7 health state

The repaired observation run now reports:

healthy_pass = true
format_success_ratio = 1.0

At the same time, it still shows that model quality is not solved yet, for example:

correctness_ratio = 0.25

So the current interpretation is:

functional chain health is back
answer quality is still limited
performance is still the main engineering follow-up

Rollout performance diagnosis

The current performance story is now much clearer.

Direct standalone URSA generation is not the source of the pathological slowdown. Probe results on rollout-like workloads show:

fsdp_train_gc = 683.406s
fsdp_train_no_gc = 68.869s
fsdp_eval_no_gc = 65.816s
raw_eval_no_gc = 44.139s

This makes the current diagnosis much more concrete:

the main slowdown is not "URSA is naturally slow"
the dominant issue is rollout decode under the training-style FSDP + gradient_checkpointing configuration
the key remaining performance task is to close that gap as much as possible under the current LightRFT constraints

Example Directory Cleanup

This branch now also cleans up examples/math_prm/ itself.

Current layout intent:

top-level files are only the active training surface and self-contained URSA runtime pieces
support scripts, smoke tools, observation tools, and regression checks live under examples/math_prm/tools/
reward_models.py and reward_models_utils.py are now math-only and trimmed to the URSA-MATH Stage 3 path
URSA_MIGRATION.md and plan/* are explicitly treated as temporary working docs to delete after the migration is closed out

Key Files

Main work in this branch now spans:

examples/math_prm/train_colocate.py
examples/math_prm/run_grpo_math_prm_ursa_8b.sh
examples/math_prm/reward_models.py
examples/math_prm/reward_models_utils.py
examples/math_prm/ursa_actor.py
examples/math_prm/sitecustomize.py
examples/math_prm/ursa_model/
examples/math_prm/tools/prepare_ursa_stage3_manifest.py
examples/math_prm/tools/check_phase2_alignment.py
examples/math_prm/tools/check_hf_rollout.py
examples/math_prm/tools/check_phase6_script_alignment.py
examples/math_prm/tools/test_phase2_alignment.py
examples/math_prm/tools/run_phase3_smoke.sh
examples/math_prm/tools/run_phase7_observation.sh
examples/math_prm/tools/analyze_phase7_observation.py
examples/math_prm/tools/probe_rollout_speed_candidates.py
lightrft/strategy/strategy_base.py
lightrft/trainer/fast_exp_maker.py
lightrft/models/actor_language.py
lightrft/models/actor_vl.py
lightrft/utils/math_prm_output.py
plan/MATH_PRM.md
plan/URSA_ROLLOUT_ENGINE_FAILURE_ANALYSIS.md
plan/PHASE7_FORMAT_STABILITY_ANALYSIS.md
plan/PHASE7_HF_ROLLOUT_PERFORMANCE_ANALYSIS.md

Testing

Commands already run in this branch include:

python -m unittest -q examples.math_prm.tools.test_phase2_alignment
python examples/math_prm/tools/check_phase2_alignment.py --device cuda:0
python examples/math_prm/tools/check_phase6_script_alignment.py
python examples/math_prm/tools/check_hf_rollout.py --output-json /data/LightRFT/tmp/ursa_stage3/hf_rollout_check.json
bash examples/math_prm/tools/run_phase3_smoke.sh
bash examples/math_prm/tools/run_phase7_observation.sh
bash -n examples/math_prm/run_grpo_math_prm_ursa_8b.sh

Current testing conclusion:

Phase 2 regression / alignment tests passed
Phase 2 real smoke alignment matched the reference sample
Phase 3 smoke reaches PPO training and cleans up correctly
local hf rollout has a standalone minimal proof script and currently passes
Phase 4 / 5 reward-path regressions are covered in the current math-only regression suite
Phase 6 launcher alignment check passes
Phase 7 observation health has been restored to a passing state
rollout performance is normalized to the standalone URSA expectation yet

Review Framing

The most accurate review framing at this point is:

Phase 1 done
Phase 2 done
Phase 3 done as a working engineering baseline
Phase 4 done
Phase 5 done
Phase 6 done
Phase 7 done with a remaining performance follow-up
Phase 8 and Phase 9 not started yet

So the dominant open question is no longer whether URSA-MATH Stage 3 can run in LightRFT at all.
The dominant open question is how much of the remaining performance gap can be closed while staying within the current LightRFT / frozen-runtime constraints.

- keep the URSA-MATH stage3 training path and required runtime wiring - retain the bilingual README files while limiting them to the minimal upstream surface - leave validation, profiling, migration notes, and local planning artifacts on the full working branch

Selectively sync the effective Stage 3 rollout changes from dev/math_prm_train_working into the upstream PR branch. - add the separate local HF rollout actor option to the PR-surface strategy path - carry over the current launcher and train_colocate updates needed for the rollout path - keep working-only docs, plans, tmp files, and auxiliary scripts out of dev/math_prm_train

(cherry picked from commit 7c5ef73)

sync the current stage3 runtime-eval path from dev/math_prm_train_working into the slim PR branch while keeping the documented PR surface consistent. - add the example-local math_prm trainer wrapper required by train_colocate.py - carry over runtime eval, separate HF rollout, and related strategy/cli updates - trim README references so the slim branch no longer points at non-migrated helper docs and scripts

clean existing trailing whitespace in the slim math_prm branch so branch-level diff --check passes after the sync. - strip trailing spaces from train_colocate and the URSA model files already carried by dev/math_prm_train - keep the change whitespace-only with no behavior updates

Sync the separate local HF rollout actor refresh fix from dev/math_prm_train_working without bringing plan materials into the PR branch. - explicitly reload the keep-on-gpu rollout actor after copying updated actor weights - preserve the rollout sync timing fields for debugging - source change corresponds to working branch commit 8c77921

HansBug mentioned this pull request Mar 18, 2026

dev(hansbug): add math PRM code #47

Closed

HansBug marked this pull request as draft March 18, 2026 07:34

HansBug force-pushed the dev/math_prm_train branch from f567589 to 9a7fec2 Compare March 18, 2026 14:11

HansBug force-pushed the dev/math_prm_train branch from 10c5f3f to fec2744 Compare March 20, 2026 11:22

puyuan1996 added documentation Improvements or additions to documentation enhancement New feature or request labels Mar 20, 2026

HansBug added 2 commits March 21, 2026 02:01

fix(wandb): remove live heartbeat logging

3ff0caf

(cherry picked from commit 7c5ef73)

puyuan1996 changed the title ~~feature(math_prm): migrate URSA-MATH stage3 training to LightRFT~~ feature(zsh): migrate URSA-MATH stage3 training to LightRFT Mar 24, 2026

HansBug added 3 commits March 26, 2026 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(zsh): migrate URSA-MATH stage3 training to LightRFT#53

feature(zsh): migrate URSA-MATH stage3 training to LightRFT#53
HansBug wants to merge 6 commits intoopendilab:mainfrom
HansBug:dev/math_prm_train

HansBug commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HansBug commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Status Map

Phase 1: Data Path / Schema / Scope

Phase 2: URSA / PRM Alignment

Phase 3: Full-Data Baseline math_prm Training Chain

Phase 4: PS-GRPO Reward Semantics

Phase 5: Answer Extraction / Correctness Alignment

Phase 6: Training Script / Hyperparameter Alignment

Phase 7: Full-Data Training Observation / Stability Validation

Phase 8: Paper Data Filtering Pipeline

Phase 9: Reproduction Close-Out

Detailed Updates Since The Earlier PR State

Rollout / Observation State

Local HF rollout

Phase 7 health state

Rollout performance diagnosis

Example Directory Cleanup

Key Files

Testing

Review Framing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HansBug commented Mar 18, 2026 •

edited

Loading

Phase 3: Full-Data Baseline `math_prm` Training Chain