opendilab
diff --git a/‎examples/math_prm/README.md‎
Lines changed: 107 additions & 9 deletions b/‎examples/math_prm/README.md‎
Lines changed: 107 additions & 9 deletions
diff --git a/‎examples/math_prm/README_zh.md‎
Lines changed: 107 additions & 8 deletions b/‎examples/math_prm/README_zh.md‎
Lines changed: 107 additions & 8 deletions
@@ -32,17 +32,115 @@ The runtime baseline is frozen by `/data/LightRFT/Dockerfile`.
 
 ```text
 examples/math_prm/
-├── prepare_ursa_stage3_manifest.py   # Convert URSA raw jsonl into LightRFT prompt manifest
-├── train_colocate.py                 # Main GRPO training entry
-├── run_grpo_math_prm_ursa_8b.sh      # URSA-8B + URSA-RM-8B training launcher
-├── reward_models.py                  # Reward implementations, including MathPRMReward
-├── reward_models_utils.py            # Reward model loading and routing
-├── prm_infer_score.py                # Step-level PRM scoring helpers
-├── test_reward_models.py             # Reward-side tests
-├── URSA_MIGRATION.md                 # Migration notes from URSA-MATH
-└── ursa_model/                       # Self-contained URSA model code
+├── README.md                         # This file, focused on the current Stage 3 path
+├── README_zh.md                      # Chinese version of this directory guide
+├── URSA_MIGRATION.md                 # Migration notes from the original URSA-MATH repo
+├── train_colocate.py                 # Main LightRFT training entry used by all current launchers
+├── run_grpo_math_prm_ursa_8b.sh      # Main Stage 3 reproduction launcher
+├── ursa_actor.py                     # URSA-specific actor wrapper for policy loading
+├── reward_models.py                  # Reward implementations; active path is MathPRMReward / PS-GRPO logic
+├── reward_models_utils.py            # Reward model loading, reward_fn routing, and label-to-recipe wiring
+├── prepare_ursa_stage3_manifest.py   # Convert raw MMathCoT-1M jsonl into LightRFT manifest
+├── prm_infer_score.py                # Step-level PRM scoring helper mirrored from URSA-MATH logic
+├── prepare_ursa_engine_checkpoint.py # Optional wrapper builder for vLLM/SGLang-style engine experiments
+├── sitecustomize.py                  # Local import/runtime compatibility hook for the URSA example stack
+├── check_phase2_alignment.py         # Phase 2 scorer parity check against the URSA reference behavior
+├── check_hf_rollout.py               # Minimal local-HF rollout validation for URSA
+├── check_phase6_script_alignment.py  # Lightweight checker for Stage 3 launcher defaults
+├── test_phase2_alignment.py          # Current regression tests for Phase 2/4/5/6 logic
+├── run_phase3_smoke.sh               # Time-boxed Phase 3 smoke launcher
+├── run_phase7_observation.sh         # Bounded full-data observation launcher
+├── analyze_phase7_observation.py     # Offline analyzer for saved Phase 7 trajectories/logs
+├── probe_rollout_speed_candidates.py # Performance probe for rollout-like generate modes without changing library code
+└── ursa_model/                       # Self-contained URSA model code used by actor and PRM loading
 ```
 
+## File Roles
+
+### 1. Primary training path
+
+- `run_grpo_math_prm_ursa_8b.sh`
+  - Main launcher for the current URSA-MATH Stage 3 reproduction path.
+  - Wires actor path, reward path, dataset path, FSDP settings, W&B, and rollout options.
+- `train_colocate.py`
+  - Real training entry used by `torchrun`.
+  - Loads actor / reference / reward model / dataset / trainer and starts the LightRFT PPO-GRPO loop.
+- `ursa_actor.py`
+  - URSA-specific actor wrapper.
+  - Makes LightRFT load `UrsaForConditionalGeneration` instead of a generic VLM auto-class.
+
+### 2. Reward and scoring path
+
+- `reward_models.py`
+  - Contains all reward model classes used in this example directory.
+  - The active Stage 3 path is `MathPRMReward` and the PS-GRPO reward mapping built on top of URSA-RM-8B.
+  - Historical Qwen2VL multi-reward classes are still present in this file, but they are not part of the current URSA-MATH Stage 3 training path.
+- `reward_models_utils.py`
+  - Handles reward model loading and reward function dispatch.
+  - Maps labels such as `math_prm` and `math_psgrpo` onto the current URSA reward path.
+- `prm_infer_score.py`
+  - Standalone helper for step-level PRM inference.
+  - Useful when comparing LightRFT reward behavior against URSA-MATH reference scoring.
+
+### 3. Data preparation and compatibility
+
+- `prepare_ursa_stage3_manifest.py`
+  - Converts raw `MMathCoT-1M` Stage 3 data into the `prompt / images / reference / label` schema expected by LightRFT.
+  - Also performs a lightweight dataset/collate smoke check.
+- `prepare_ursa_engine_checkpoint.py`
+  - Optional helper for engine experiments.
+  - Builds an engine-friendly wrapper checkpoint with the local URSA model code and `auto_map` metadata so vLLM/SGLang can at least attempt to load URSA.
+- `sitecustomize.py`
+  - Local runtime/import hook used to keep this example stack compatible under the frozen Docker baseline.
+
+### 4. Validation, smoke, and observation tools
+
+- `check_phase2_alignment.py`
+  - Verifies that LightRFT `MathPRMReward` remains aligned with the URSA reference scorer on a concrete sample.
+- `check_hf_rollout.py`
+  - Minimal end-to-end validation for LightRFT local `hf` rollout.
+  - Compares `gather_and_generate()` output against direct `actor.generate()`.
+- `check_phase6_script_alignment.py`
+  - Static checker that confirms the Stage 3 launcher still matches the intended defaults.
+- `test_phase2_alignment.py`
+  - Current regression test file for the URSA Stage 3 path.
+  - Covers alignment, reward mapping, answer extraction, rollout helper behavior, and related utilities.
+- `run_phase3_smoke.sh`
+  - Time-boxed Phase 3 smoke launcher for “can it run, does it trend normally, and do we clean up GPUs afterward”.
+- `run_phase7_observation.sh`
+  - Bounded full-data observation launcher for later-stage analysis.
+- `analyze_phase7_observation.py`
+  - Offline analyzer for saved trajectories and training logs.
+  - Computes the Phase 7 health checklist and PRM image-ablation summary.
+- `probe_rollout_speed_candidates.py`
+  - Minimal benchmark for rollout-like decode speed.
+  - Used to compare `fsdp_train_gc`, `fsdp_train_no_gc`, `fsdp_eval_no_gc`, and `raw_eval_no_gc` without modifying `lightrft/` itself.
+
+### 5. Self-contained URSA runtime
+
+- `ursa_model/`
+  - Local copy of the URSA model stack needed by both the actor and the PRM.
+  - Includes config, processor, image processor, projector, vision backbones, and model definitions.
+  - This directory is what allows the current Stage 3 path to run without depending on importing code directly from the external URSA-MATH repo.
+
+## Current Entry Points
+
+If you only care about the active Stage 3 reproduction path, the files you usually need are:
+
+- `run_grpo_math_prm_ursa_8b.sh`
+- `train_colocate.py`
+- `reward_models.py`
+- `reward_models_utils.py`
+- `prepare_ursa_stage3_manifest.py`
+- `check_hf_rollout.py`
+- `test_phase2_alignment.py`
+
+Everything else in this directory is either:
+
+- a one-off compatibility helper,
+- a smoke/observation tool,
+- or part of the self-contained URSA runtime.
+
 ## Local Resources
 
 The current machine layout is:
 
@@ -28,17 +28,116 @@ URSA-MATH Stage 3 PS-GRPO 训练迁移到 LightRFT 的实现目录。
 
 ```text
 examples/math_prm/
-├── prepare_ursa_stage3_manifest.py   # 把 URSA raw jsonl 转成 LightRFT prompt manifest
-├── train_colocate.py                 # 主训练入口
-├── run_grpo_math_prm_ursa_8b.sh      # URSA-8B + URSA-RM-8B 训练脚本
-├── reward_models.py                  # reward 实现，含 MathPRMReward
-├── reward_models_utils.py            # reward model 加载与路由
-├── prm_infer_score.py                # step-level PRM 打分逻辑
-├── test_reward_models.py             # reward 侧测试
-├── URSA_MIGRATION.md                 # 从 URSA-MATH 迁移过来的说明
+├── README.md                         # 当前 Stage 3 路径的英文说明
+├── README_zh.md                      # 当前目录的中文结构说明
+├── URSA_MIGRATION.md                 # 从原始 URSA-MATH repo 迁移到 LightRFT 的说明
+├── train_colocate.py                 # 当前训练主入口
+├── run_grpo_math_prm_ursa_8b.sh      # 当前 Stage 3 复现实验主脚本
+├── ursa_actor.py                     # URSA policy model 的自定义 actor 包装
+├── reward_models.py                  # reward 实现；当前主路径是 MathPRMReward / PS-GRPO
+├── reward_models_utils.py            # reward model 加载、label 路由和 reward_fn 组装
+├── prepare_ursa_stage3_manifest.py   # 把原始 MMathCoT-1M Stage 3 数据转换成 LightRFT manifest
+├── prm_infer_score.py                # step-level PRM 打分辅助脚本
+├── prepare_ursa_engine_checkpoint.py # 给 vLLM/SGLang 试验准备 wrapper checkpoint 的辅助脚本
+├── sitecustomize.py                  # 当前 example 栈的本地兼容性补丁入口
+├── check_phase2_alignment.py         # Phase 2 打分对齐检查脚本
+├── check_hf_rollout.py               # 本地 HF rollout 最小链路校验
+├── check_phase6_script_alignment.py  # Stage 3 启动脚本默认配置检查器
+├── test_phase2_alignment.py          # 当前 URSA Stage 3 路径的回归测试
+├── run_phase3_smoke.sh               # 限时 Phase 3 smoke 试跑脚本
+├── run_phase7_observation.sh         # 全量 bounded observation 启动脚本
+├── analyze_phase7_observation.py     # Phase 7 离线分析脚本
+├── probe_rollout_speed_candidates.py # 不改库代码时的 rollout 速度对照脚本
 └── ursa_model/                       # 自包含的 URSA 模型代码
 ```
 
+## 文件职责
+
+### 1. 主训练路径
+
+- `run_grpo_math_prm_ursa_8b.sh`
+  - 当前 URSA-MATH Stage 3 复现实验的主启动脚本。
+  - 负责拼 actor、reward、数据、FSDP、W&B 和 rollout 相关参数。
+- `train_colocate.py`
+  - 被 `torchrun` 直接调用的真实训练入口。
+  - 负责加载 actor / reference / reward model / dataset / trainer，并启动 LightRFT 训练循环。
+- `ursa_actor.py`
+  - URSA 专用 actor 包装。
+  - 让 LightRFT 用 `UrsaForConditionalGeneration` 来加载 policy 模型。
+
+### 2. Reward 与打分路径
+
+- `reward_models.py`
+  - 当前目录下所有 reward model 实现都在这里。
+  - 现在真正活跃的是 `MathPRMReward` 和基于 URSA-RM-8B 的 PS-GRPO reward 映射。
+  - 文件里还保留了一些历史的 Qwen2VL 多 reward 类，但它们已经不属于当前 URSA-MATH Stage 3 主路径。
+- `reward_models_utils.py`
+  - 负责 reward model 的加载、label 到 recipe 的映射，以及 reward_fn 组装。
+  - 当前 `math_prm` / `math_psgrpo` 的路由逻辑都在这里。
+- `prm_infer_score.py`
+  - 独立的 step-level PRM 打分辅助脚本。
+  - 适合在 LightRFT 行为和 URSA-MATH 参考实现之间做单点对比。
+
+### 3. 数据准备与兼容性
+
+- `prepare_ursa_stage3_manifest.py`
+  - 把原始 `MMathCoT-1M` Stage 3 数据转换成 LightRFT 需要的 `prompt / images / reference / label` schema。
+  - 同时会做一次轻量级 dataset/collate smoke 检查。
+- `prepare_ursa_engine_checkpoint.py`
+  - 不是当前 `hf` 主线必需，但仍然用于 engine 试验。
+  - 它会构造带本地 `ursa_model` 代码和 `auto_map` 元数据的 wrapper checkpoint，供 vLLM/SGLang 尝试加载 URSA。
+- `sitecustomize.py`
+  - 在当前冻结 Docker 基线下，为 example 目录提供本地运行时兼容性补丁。
+
+### 4. 校验、smoke 与观测工具
+
+- `check_phase2_alignment.py`
+  - 检查 LightRFT 的 `MathPRMReward` 是否和 URSA 参考 scorer 保持一致。
+- `check_hf_rollout.py`
+  - 本地 `hf` rollout 的最小链路校验。
+  - 会把 `gather_and_generate()` 的输出和直接 `actor.generate()` 做对比。
+- `check_phase6_script_alignment.py`
+  - 静态检查当前 Stage 3 启动脚本默认值是否仍然对齐。
+- `test_phase2_alignment.py`
+  - 当前 URSA Stage 3 主路径的回归测试集合。
+  - 覆盖 scorer 对齐、reward 映射、答案抽取、rollout 辅助逻辑等。
+- `run_phase3_smoke.sh`
+  - 限时 Phase 3 smoke 试跑脚本。
+  - 用来验证“能否正常起训、指标是否合理、结束后 GPU 是否清干净”。
+- `run_phase7_observation.sh`
+  - bounded full-data observation 启动脚本。
+- `analyze_phase7_observation.py`
+  - 对 Phase 7 保存下来的 trajectories 和训练日志做离线分析。
+  - 用于计算 health checklist 和 PRM 图像消融结果。
+- `probe_rollout_speed_candidates.py`
+  - 不修改 `lightrft/` 主链时，用来比较几种 rollout-like decode 运行形态速度的最小测速脚本。
+  - 目前主要用来确认 `gradient_checkpointing` 是 rollout 速度问题的主因。
+
+### 5. 自包含 URSA 运行时
+
+- `ursa_model/`
+  - 本地复制的 URSA 模型栈。
+  - 包含 config、processor、image processor、projector、vision tower 和 model 定义。
+  - 这部分是当前 Stage 3 路径能脱离外部 URSA-MATH repo 直接运行的基础。
+
+## 当前真正需要关注的入口
+
+如果你只关心当前 URSA-MATH Stage 3 复现主线，通常只需要重点看这些文件：
+
+- `run_grpo_math_prm_ursa_8b.sh`
+- `train_colocate.py`
+- `reward_models.py`
+- `reward_models_utils.py`
+- `prepare_ursa_stage3_manifest.py`
+- `check_hf_rollout.py`
+- `test_phase2_alignment.py`
+
+目录中其他文件大多属于：
+
+- 兼容性辅助脚本，
+- smoke / observation / profiling 工具，
+- 或自包含的 URSA 运行时代码。
+
 ## 本机资源路径
 
 当前机器上的关键资源路径如下：