feature(zsh): migrate URSA-MATH stage3 training to LightRFT#53
Draft
HansBug wants to merge 6 commits intoopendilab:mainfrom
Draft
feature(zsh): migrate URSA-MATH stage3 training to LightRFT#53HansBug wants to merge 6 commits intoopendilab:mainfrom
HansBug wants to merge 6 commits intoopendilab:mainfrom
Conversation
f567589 to
9a7fec2
Compare
- keep the URSA-MATH stage3 training path and required runtime wiring - retain the bilingual README files while limiting them to the minimal upstream surface - leave validation, profiling, migration notes, and local planning artifacts on the full working branch
10c5f3f to
fec2744
Compare
Selectively sync the effective Stage 3 rollout changes from dev/math_prm_train_working into the upstream PR branch. - add the separate local HF rollout actor option to the PR-surface strategy path - carry over the current launcher and train_colocate updates needed for the rollout path - keep working-only docs, plans, tmp files, and auxiliary scripts out of dev/math_prm_train
(cherry picked from commit 7c5ef73)
sync the current stage3 runtime-eval path from dev/math_prm_train_working into the slim PR branch while keeping the documented PR surface consistent. - add the example-local math_prm trainer wrapper required by train_colocate.py - carry over runtime eval, separate HF rollout, and related strategy/cli updates - trim README references so the slim branch no longer points at non-migrated helper docs and scripts
clean existing trailing whitespace in the slim math_prm branch so branch-level diff --check passes after the sync. - strip trailing spaces from train_colocate and the URSA model files already carried by dev/math_prm_train - keep the change whitespace-only with no behavior updates
Sync the separate local HF rollout actor refresh fix from dev/math_prm_train_working without bringing plan materials into the PR branch. - explicitly reload the keep-on-gpu rollout actor after copying updated actor weights - preserve the rollout sync timing fields for debugging - source change corresponds to working branch commit 8c77921
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR migrates the URSA-MATH Stage 3 training path into LightRFT under the current frozen Docker baseline, and now also trims the example directory down to the URSA-MATH Stage 3 surface instead of keeping older unrelated example baggage.
Current high-level state:
hfrollout for URSA is working and has standalone proofs / regression coverageWorking notes that still exist during the migration:
plan/MATH_PRM.mdplan/URSA_ROLLOUT_ENGINE_FAILURE_ANALYSIS.mdplan/PHASE7_FORMAT_STABILITY_ANALYSIS.mdplan/PHASE7_HF_ROLLOUT_PERFORMANCE_ANALYSIS.mdStatus Map
This section is the current project map.
Phase 1: Data Path / Schema / Scope
Status:
doneBrief:
prompt / images / reference / labelmanifest pathMMathCoT-1Mmanifest is intentionally used firstChecklist:
PromptDatasetVLcan consume the manifestPhase 2: URSA / PRM Alignment
Status:
doneBrief:
Checklist:
URSA-8Bwith explicitUrsaProcessor.from_pretrained(...)URSA-RM-8Bto stay on the direct HF pathMathPRMRewardPhase 3: Full-Data Baseline
math_prmTraining ChainStatus:
doneBrief:
dataloader -> rollout -> reward -> PPO train -> checkpoint / trajectory save -> cleanuphfrollout is now the stable engineering path for URSA under the frozen runtimeChecklist:
math_prm = min(step_scores)<think>-style format reward from the effectivemath_prmpathdataloader -> rollout -> reward -> PPO train -> cleanup~943-946regime to a reasonable smoke regimePhase 4: PS-GRPO Reward Semantics
Status:
doneBrief:
min(step_scores)to the paper-aligned PS-GRPO-style reward pathmath_prmis preserved as the baseline label andmath_psgrpois introduced as the Stage 3 reward pathChecklist:
math_psgrpomath_prmreserved as the Phase 3 baseline rewardstep_scoresrho = 0.3drop-moment detectiongamma = 0.5reward mapping1.0 / 0.5 / 0.0step_scores,max_relative_drop,has_drop_moment,outcome_correct, andfinal_rewardPhase 5: Answer Extraction / Correctness Alignment
Status:
doneBrief:
Checklist:
mathrulerwhere appropriate†Answer:is missingPhase 6: Training Script / Hyperparameter Alignment
Status:
doneBrief:
Checklist:
n_samples_per_prompt,temperature,init_kl_coef,actor_learning_rate,prompt_max_len, andgenerate_max_lentrain_batch_size = 512implementation strategy under 8 GPUsPhase 7: Full-Data Training Observation / Stability Validation
Status:
done with follow-upBrief:
healthy_pass = trueChecklist:
Step N:/†Answer:compliancehealthy_pass = truefor the observation health gatePhase 8: Paper Data Filtering Pipeline
Status:
plannedBrief:
20K candidate -> 8 samples -> remove all-correct/all-wrong -> 15.3Kpipeline is still intentionally deferred until the chain, reward path, and observation loop are stableChecklist:
20Kcandidates from fullMMathCoT-1M8outputs per prompt withURSA-8B15.3KPromptDatasetVLPhase 9: Reproduction Close-Out
Status:
plannedBrief:
Checklist:
Detailed Updates Since The Earlier PR State
The earlier PR body was effectively frozen around "Phase 4 ready to start". That is no longer accurate.
What has been completed since then:
healthy_pass = truemath_prmfolder is now centered on URSA-MATH Stage 3 rather than older unrelated example baggageRollout / Observation State
Local HF rollout
A standalone validation script now exists at:
examples/math_prm/tools/check_hf_rollout.pyThis script proves that the local LightRFT
hfrollout path for URSA is actually working by:URSA-8Bsetup_inference_engine(engine_type="hf")gather_and_generate()actor.generate()token by tokenPhase 7 health state
The repaired observation run now reports:
healthy_pass = trueformat_success_ratio = 1.0At the same time, it still shows that model quality is not solved yet, for example:
correctness_ratio = 0.25So the current interpretation is:
Rollout performance diagnosis
The current performance story is now much clearer.
Direct standalone URSA generation is not the source of the pathological slowdown. Probe results on rollout-like workloads show:
fsdp_train_gc = 683.406sfsdp_train_no_gc = 68.869sfsdp_eval_no_gc = 65.816sraw_eval_no_gc = 44.139sThis makes the current diagnosis much more concrete:
FSDP + gradient_checkpointingconfigurationExample Directory Cleanup
This branch now also cleans up
examples/math_prm/itself.Current layout intent:
examples/math_prm/tools/reward_models.pyandreward_models_utils.pyare now math-only and trimmed to the URSA-MATH Stage 3 pathURSA_MIGRATION.mdandplan/*are explicitly treated as temporary working docs to delete after the migration is closed outKey Files
Main work in this branch now spans:
examples/math_prm/train_colocate.pyexamples/math_prm/run_grpo_math_prm_ursa_8b.shexamples/math_prm/reward_models.pyexamples/math_prm/reward_models_utils.pyexamples/math_prm/ursa_actor.pyexamples/math_prm/sitecustomize.pyexamples/math_prm/ursa_model/examples/math_prm/tools/prepare_ursa_stage3_manifest.pyexamples/math_prm/tools/check_phase2_alignment.pyexamples/math_prm/tools/check_hf_rollout.pyexamples/math_prm/tools/check_phase6_script_alignment.pyexamples/math_prm/tools/test_phase2_alignment.pyexamples/math_prm/tools/run_phase3_smoke.shexamples/math_prm/tools/run_phase7_observation.shexamples/math_prm/tools/analyze_phase7_observation.pyexamples/math_prm/tools/probe_rollout_speed_candidates.pylightrft/strategy/strategy_base.pylightrft/trainer/fast_exp_maker.pylightrft/models/actor_language.pylightrft/models/actor_vl.pylightrft/utils/math_prm_output.pyplan/MATH_PRM.mdplan/URSA_ROLLOUT_ENGINE_FAILURE_ANALYSIS.mdplan/PHASE7_FORMAT_STABILITY_ANALYSIS.mdplan/PHASE7_HF_ROLLOUT_PERFORMANCE_ANALYSIS.mdTesting
Commands already run in this branch include:
Current testing conclusion:
hfrollout has a standalone minimal proof script and currently passesReview Framing
The most accurate review framing at this point is:
So the dominant open question is no longer whether URSA-MATH Stage 3 can run in LightRFT at all.
The dominant open question is how much of the remaining performance gap can be closed while staying within the current LightRFT / frozen-runtime constraints.