Skip to content

Commit fd824ca

Browse files
committed
docs(math_prm): remove legacy example files
1 parent b3c0971 commit fd824ca

File tree

6 files changed

+214
-634
lines changed

6 files changed

+214
-634
lines changed

examples/math_prm/README.md

Lines changed: 107 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -32,17 +32,115 @@ The runtime baseline is frozen by `/data/LightRFT/Dockerfile`.
3232

3333
```text
3434
examples/math_prm/
35-
├── prepare_ursa_stage3_manifest.py # Convert URSA raw jsonl into LightRFT prompt manifest
36-
├── train_colocate.py # Main GRPO training entry
37-
├── run_grpo_math_prm_ursa_8b.sh # URSA-8B + URSA-RM-8B training launcher
38-
├── reward_models.py # Reward implementations, including MathPRMReward
39-
├── reward_models_utils.py # Reward model loading and routing
40-
├── prm_infer_score.py # Step-level PRM scoring helpers
41-
├── test_reward_models.py # Reward-side tests
42-
├── URSA_MIGRATION.md # Migration notes from URSA-MATH
43-
└── ursa_model/ # Self-contained URSA model code
35+
├── README.md # This file, focused on the current Stage 3 path
36+
├── README_zh.md # Chinese version of this directory guide
37+
├── URSA_MIGRATION.md # Migration notes from the original URSA-MATH repo
38+
├── train_colocate.py # Main LightRFT training entry used by all current launchers
39+
├── run_grpo_math_prm_ursa_8b.sh # Main Stage 3 reproduction launcher
40+
├── ursa_actor.py # URSA-specific actor wrapper for policy loading
41+
├── reward_models.py # Reward implementations; active path is MathPRMReward / PS-GRPO logic
42+
├── reward_models_utils.py # Reward model loading, reward_fn routing, and label-to-recipe wiring
43+
├── prepare_ursa_stage3_manifest.py # Convert raw MMathCoT-1M jsonl into LightRFT manifest
44+
├── prm_infer_score.py # Step-level PRM scoring helper mirrored from URSA-MATH logic
45+
├── prepare_ursa_engine_checkpoint.py # Optional wrapper builder for vLLM/SGLang-style engine experiments
46+
├── sitecustomize.py # Local import/runtime compatibility hook for the URSA example stack
47+
├── check_phase2_alignment.py # Phase 2 scorer parity check against the URSA reference behavior
48+
├── check_hf_rollout.py # Minimal local-HF rollout validation for URSA
49+
├── check_phase6_script_alignment.py # Lightweight checker for Stage 3 launcher defaults
50+
├── test_phase2_alignment.py # Current regression tests for Phase 2/4/5/6 logic
51+
├── run_phase3_smoke.sh # Time-boxed Phase 3 smoke launcher
52+
├── run_phase7_observation.sh # Bounded full-data observation launcher
53+
├── analyze_phase7_observation.py # Offline analyzer for saved Phase 7 trajectories/logs
54+
├── probe_rollout_speed_candidates.py # Performance probe for rollout-like generate modes without changing library code
55+
└── ursa_model/ # Self-contained URSA model code used by actor and PRM loading
4456
```
4557

58+
## File Roles
59+
60+
### 1. Primary training path
61+
62+
- `run_grpo_math_prm_ursa_8b.sh`
63+
- Main launcher for the current URSA-MATH Stage 3 reproduction path.
64+
- Wires actor path, reward path, dataset path, FSDP settings, W&B, and rollout options.
65+
- `train_colocate.py`
66+
- Real training entry used by `torchrun`.
67+
- Loads actor / reference / reward model / dataset / trainer and starts the LightRFT PPO-GRPO loop.
68+
- `ursa_actor.py`
69+
- URSA-specific actor wrapper.
70+
- Makes LightRFT load `UrsaForConditionalGeneration` instead of a generic VLM auto-class.
71+
72+
### 2. Reward and scoring path
73+
74+
- `reward_models.py`
75+
- Contains all reward model classes used in this example directory.
76+
- The active Stage 3 path is `MathPRMReward` and the PS-GRPO reward mapping built on top of URSA-RM-8B.
77+
- Historical Qwen2VL multi-reward classes are still present in this file, but they are not part of the current URSA-MATH Stage 3 training path.
78+
- `reward_models_utils.py`
79+
- Handles reward model loading and reward function dispatch.
80+
- Maps labels such as `math_prm` and `math_psgrpo` onto the current URSA reward path.
81+
- `prm_infer_score.py`
82+
- Standalone helper for step-level PRM inference.
83+
- Useful when comparing LightRFT reward behavior against URSA-MATH reference scoring.
84+
85+
### 3. Data preparation and compatibility
86+
87+
- `prepare_ursa_stage3_manifest.py`
88+
- Converts raw `MMathCoT-1M` Stage 3 data into the `prompt / images / reference / label` schema expected by LightRFT.
89+
- Also performs a lightweight dataset/collate smoke check.
90+
- `prepare_ursa_engine_checkpoint.py`
91+
- Optional helper for engine experiments.
92+
- Builds an engine-friendly wrapper checkpoint with the local URSA model code and `auto_map` metadata so vLLM/SGLang can at least attempt to load URSA.
93+
- `sitecustomize.py`
94+
- Local runtime/import hook used to keep this example stack compatible under the frozen Docker baseline.
95+
96+
### 4. Validation, smoke, and observation tools
97+
98+
- `check_phase2_alignment.py`
99+
- Verifies that LightRFT `MathPRMReward` remains aligned with the URSA reference scorer on a concrete sample.
100+
- `check_hf_rollout.py`
101+
- Minimal end-to-end validation for LightRFT local `hf` rollout.
102+
- Compares `gather_and_generate()` output against direct `actor.generate()`.
103+
- `check_phase6_script_alignment.py`
104+
- Static checker that confirms the Stage 3 launcher still matches the intended defaults.
105+
- `test_phase2_alignment.py`
106+
- Current regression test file for the URSA Stage 3 path.
107+
- Covers alignment, reward mapping, answer extraction, rollout helper behavior, and related utilities.
108+
- `run_phase3_smoke.sh`
109+
- Time-boxed Phase 3 smoke launcher for “can it run, does it trend normally, and do we clean up GPUs afterward”.
110+
- `run_phase7_observation.sh`
111+
- Bounded full-data observation launcher for later-stage analysis.
112+
- `analyze_phase7_observation.py`
113+
- Offline analyzer for saved trajectories and training logs.
114+
- Computes the Phase 7 health checklist and PRM image-ablation summary.
115+
- `probe_rollout_speed_candidates.py`
116+
- Minimal benchmark for rollout-like decode speed.
117+
- Used to compare `fsdp_train_gc`, `fsdp_train_no_gc`, `fsdp_eval_no_gc`, and `raw_eval_no_gc` without modifying `lightrft/` itself.
118+
119+
### 5. Self-contained URSA runtime
120+
121+
- `ursa_model/`
122+
- Local copy of the URSA model stack needed by both the actor and the PRM.
123+
- Includes config, processor, image processor, projector, vision backbones, and model definitions.
124+
- This directory is what allows the current Stage 3 path to run without depending on importing code directly from the external URSA-MATH repo.
125+
126+
## Current Entry Points
127+
128+
If you only care about the active Stage 3 reproduction path, the files you usually need are:
129+
130+
- `run_grpo_math_prm_ursa_8b.sh`
131+
- `train_colocate.py`
132+
- `reward_models.py`
133+
- `reward_models_utils.py`
134+
- `prepare_ursa_stage3_manifest.py`
135+
- `check_hf_rollout.py`
136+
- `test_phase2_alignment.py`
137+
138+
Everything else in this directory is either:
139+
140+
- a one-off compatibility helper,
141+
- a smoke/observation tool,
142+
- or part of the self-contained URSA runtime.
143+
46144
## Local Resources
47145

48146
The current machine layout is:

examples/math_prm/README_zh.md

Lines changed: 107 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -28,17 +28,116 @@ URSA-MATH Stage 3 PS-GRPO 训练迁移到 LightRFT 的实现目录。
2828

2929
```text
3030
examples/math_prm/
31-
├── prepare_ursa_stage3_manifest.py # 把 URSA raw jsonl 转成 LightRFT prompt manifest
32-
├── train_colocate.py # 主训练入口
33-
├── run_grpo_math_prm_ursa_8b.sh # URSA-8B + URSA-RM-8B 训练脚本
34-
├── reward_models.py # reward 实现,含 MathPRMReward
35-
├── reward_models_utils.py # reward model 加载与路由
36-
├── prm_infer_score.py # step-level PRM 打分逻辑
37-
├── test_reward_models.py # reward 侧测试
38-
├── URSA_MIGRATION.md # 从 URSA-MATH 迁移过来的说明
31+
├── README.md # 当前 Stage 3 路径的英文说明
32+
├── README_zh.md # 当前目录的中文结构说明
33+
├── URSA_MIGRATION.md # 从原始 URSA-MATH repo 迁移到 LightRFT 的说明
34+
├── train_colocate.py # 当前训练主入口
35+
├── run_grpo_math_prm_ursa_8b.sh # 当前 Stage 3 复现实验主脚本
36+
├── ursa_actor.py # URSA policy model 的自定义 actor 包装
37+
├── reward_models.py # reward 实现;当前主路径是 MathPRMReward / PS-GRPO
38+
├── reward_models_utils.py # reward model 加载、label 路由和 reward_fn 组装
39+
├── prepare_ursa_stage3_manifest.py # 把原始 MMathCoT-1M Stage 3 数据转换成 LightRFT manifest
40+
├── prm_infer_score.py # step-level PRM 打分辅助脚本
41+
├── prepare_ursa_engine_checkpoint.py # 给 vLLM/SGLang 试验准备 wrapper checkpoint 的辅助脚本
42+
├── sitecustomize.py # 当前 example 栈的本地兼容性补丁入口
43+
├── check_phase2_alignment.py # Phase 2 打分对齐检查脚本
44+
├── check_hf_rollout.py # 本地 HF rollout 最小链路校验
45+
├── check_phase6_script_alignment.py # Stage 3 启动脚本默认配置检查器
46+
├── test_phase2_alignment.py # 当前 URSA Stage 3 路径的回归测试
47+
├── run_phase3_smoke.sh # 限时 Phase 3 smoke 试跑脚本
48+
├── run_phase7_observation.sh # 全量 bounded observation 启动脚本
49+
├── analyze_phase7_observation.py # Phase 7 离线分析脚本
50+
├── probe_rollout_speed_candidates.py # 不改库代码时的 rollout 速度对照脚本
3951
└── ursa_model/ # 自包含的 URSA 模型代码
4052
```
4153

54+
## 文件职责
55+
56+
### 1. 主训练路径
57+
58+
- `run_grpo_math_prm_ursa_8b.sh`
59+
- 当前 URSA-MATH Stage 3 复现实验的主启动脚本。
60+
- 负责拼 actor、reward、数据、FSDP、W&B 和 rollout 相关参数。
61+
- `train_colocate.py`
62+
-`torchrun` 直接调用的真实训练入口。
63+
- 负责加载 actor / reference / reward model / dataset / trainer,并启动 LightRFT 训练循环。
64+
- `ursa_actor.py`
65+
- URSA 专用 actor 包装。
66+
- 让 LightRFT 用 `UrsaForConditionalGeneration` 来加载 policy 模型。
67+
68+
### 2. Reward 与打分路径
69+
70+
- `reward_models.py`
71+
- 当前目录下所有 reward model 实现都在这里。
72+
- 现在真正活跃的是 `MathPRMReward` 和基于 URSA-RM-8B 的 PS-GRPO reward 映射。
73+
- 文件里还保留了一些历史的 Qwen2VL 多 reward 类,但它们已经不属于当前 URSA-MATH Stage 3 主路径。
74+
- `reward_models_utils.py`
75+
- 负责 reward model 的加载、label 到 recipe 的映射,以及 reward_fn 组装。
76+
- 当前 `math_prm` / `math_psgrpo` 的路由逻辑都在这里。
77+
- `prm_infer_score.py`
78+
- 独立的 step-level PRM 打分辅助脚本。
79+
- 适合在 LightRFT 行为和 URSA-MATH 参考实现之间做单点对比。
80+
81+
### 3. 数据准备与兼容性
82+
83+
- `prepare_ursa_stage3_manifest.py`
84+
- 把原始 `MMathCoT-1M` Stage 3 数据转换成 LightRFT 需要的 `prompt / images / reference / label` schema。
85+
- 同时会做一次轻量级 dataset/collate smoke 检查。
86+
- `prepare_ursa_engine_checkpoint.py`
87+
- 不是当前 `hf` 主线必需,但仍然用于 engine 试验。
88+
- 它会构造带本地 `ursa_model` 代码和 `auto_map` 元数据的 wrapper checkpoint,供 vLLM/SGLang 尝试加载 URSA。
89+
- `sitecustomize.py`
90+
- 在当前冻结 Docker 基线下,为 example 目录提供本地运行时兼容性补丁。
91+
92+
### 4. 校验、smoke 与观测工具
93+
94+
- `check_phase2_alignment.py`
95+
- 检查 LightRFT 的 `MathPRMReward` 是否和 URSA 参考 scorer 保持一致。
96+
- `check_hf_rollout.py`
97+
- 本地 `hf` rollout 的最小链路校验。
98+
- 会把 `gather_and_generate()` 的输出和直接 `actor.generate()` 做对比。
99+
- `check_phase6_script_alignment.py`
100+
- 静态检查当前 Stage 3 启动脚本默认值是否仍然对齐。
101+
- `test_phase2_alignment.py`
102+
- 当前 URSA Stage 3 主路径的回归测试集合。
103+
- 覆盖 scorer 对齐、reward 映射、答案抽取、rollout 辅助逻辑等。
104+
- `run_phase3_smoke.sh`
105+
- 限时 Phase 3 smoke 试跑脚本。
106+
- 用来验证“能否正常起训、指标是否合理、结束后 GPU 是否清干净”。
107+
- `run_phase7_observation.sh`
108+
- bounded full-data observation 启动脚本。
109+
- `analyze_phase7_observation.py`
110+
- 对 Phase 7 保存下来的 trajectories 和训练日志做离线分析。
111+
- 用于计算 health checklist 和 PRM 图像消融结果。
112+
- `probe_rollout_speed_candidates.py`
113+
- 不修改 `lightrft/` 主链时,用来比较几种 rollout-like decode 运行形态速度的最小测速脚本。
114+
- 目前主要用来确认 `gradient_checkpointing` 是 rollout 速度问题的主因。
115+
116+
### 5. 自包含 URSA 运行时
117+
118+
- `ursa_model/`
119+
- 本地复制的 URSA 模型栈。
120+
- 包含 config、processor、image processor、projector、vision tower 和 model 定义。
121+
- 这部分是当前 Stage 3 路径能脱离外部 URSA-MATH repo 直接运行的基础。
122+
123+
## 当前真正需要关注的入口
124+
125+
如果你只关心当前 URSA-MATH Stage 3 复现主线,通常只需要重点看这些文件:
126+
127+
- `run_grpo_math_prm_ursa_8b.sh`
128+
- `train_colocate.py`
129+
- `reward_models.py`
130+
- `reward_models_utils.py`
131+
- `prepare_ursa_stage3_manifest.py`
132+
- `check_hf_rollout.py`
133+
- `test_phase2_alignment.py`
134+
135+
目录中其他文件大多属于:
136+
137+
- 兼容性辅助脚本,
138+
- smoke / observation / profiling 工具,
139+
- 或自包含的 URSA 运行时代码。
140+
42141
## 本机资源路径
43142

44143
当前机器上的关键资源路径如下:

0 commit comments

Comments
 (0)