@@ -113,27 +113,14 @@ Example (edit model & TP according to your GPUs):
113113
114114``` bash
115115trl vllm-serve \
116- --model Qwen/Qwen2.5 -Math-1.5B \
116+ --model cbyzju/LaPHA -Math-7B-Instruct \
117117 --host 0.0.0.0 \
118118 --port 8000 \
119119 --tensor-parallel-size 1 \
120120 --max-model-len 4096
121121```
122122
123123### 2) Run eval
124-
125- ** Single-turn (no search):**
126-
127- ``` bash
128- ENGINE=vllm BASE_URL=http://localhost:8000 \
129- TOKENIZER_PATH=Qwen/Qwen2.5-Math-1.5B \
130- MODE=single \
131- bash eval.sh math
132- ```
133-
134- ** Value-guided MCTS (recommended):**
135- You need a value head checkpoint (` .pt ` ) and a base LM path for the value function.
136-
137124``` bash
138125ENGINE=vllm BASE_URL=http://localhost:8000 \
139126TOKENIZER_PATH=/path/to/policy_model \
@@ -149,48 +136,3 @@ Outputs:
149136
150137* rollouts: ` eval/rollouts/*.pred.jsonl `
151138* scores: ` eval/results/*.csv ` (and logs under ` eval/logs/ ` )
152-
153- ---
154-
155- ## Training (LaPha RL)
156-
157- The training entry point is ` run_dapo.py ` (configured by ` lapha.yaml ` ).
158-
159- ### 1) Prepare training data
160-
161- The current dataloader (` helpers/math_dapo.py::dataloader ` ) expects a ** Parquet** in a DAPO-like format
162- (e.g., columns ` prompt ` and ` reward_model ` containing ground-truth).
163-
164- Update the dataset path inside ` run_dapo.py ` :
165-
166- ``` python
167- train_dataset = dataloader_dapo(' YOUR_DATA_PATH/train.parquet' ).shuffle()
168- ```
169-
170- ### 2) Start vLLM + tool server
171-
172- * Start vLLM server (same as eval)
173- * Start ` rpc_python_server.py ` (same as eval)
174-
175- ### 3) Launch training
176-
177- ``` bash
178- bash run_dapo.sh
179- # or:
180- accelerate launch --config_file deepspeed_zero3.yaml run_dapo.py --config lapha.yaml
181- ```
182-
183- Checkpoints are saved under ` output_dir ` in ` lapha.yaml ` .
184-
185- ### 4) (Optional) Split value head for serving
186-
187- If your checkpoint directory contains a wrapper model that includes both policy + value head, use:
188-
189- ``` bash
190- python helpers/split_valuehead.py \
191- --src /path/to/checkpoint-XX \
192- --out-policy /path/to/policy_model_ckptXX \
193- --out-vhead /path/to/value_head_ckptXX.pt \
194- --copy-tokenizer \
195- --trust-remote-code
196- ```
0 commit comments