Skip to content

Commit 4378b8a

Browse files
docs: grpo-chtc-debugging-log — 11 issues hit during week 4 task 5, in order
1 parent f91e94b commit 4378b8a

1 file changed

Lines changed: 327 additions & 0 deletions

File tree

docs/grpo-chtc-debugging-log.md

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
# GRPO + verl + CHTC — debugging log
2+
3+
> Iterative log of every issue we hit getting `verl.trainer.main_ppo` to
4+
> actually run a smoke GRPO step on CHTC, in order of occurrence. Read top
5+
> to bottom to see the failure cascade; each fix unlocks the next layer.
6+
> Spec: Qwen-2.5-Coder-1.5B + LoRA-merged SFT warm-start, single A100,
7+
> verlai/verl:vllm012.latest container.
8+
9+
## Setup that was already working before this list
10+
11+
By Week 4 Task 4 we already had:
12+
- HTCondor submit + execute scripts (`chtc/train_grpo.{sub,sh}`)
13+
- `USER` / `LOGNAME` exported (no `/etc/passwd` entry in container)
14+
- UUID-form `CUDA_VISIBLE_DEVICES` remap to integer index
15+
- Tarball-of-repo flow from `/staging/`
16+
- Resilient tar pattern (always emit `checkpoint.tar.gz`, even on training failure, so HTCondor's output transfer never holds the job)
17+
- Merge LoRA adapter → fresh `merged_<name>` dir as starting policy
18+
- Pre-warm of evalplus datasets
19+
20+
These came from Weeks 2 and 3 and are documented in their summary files.
21+
22+
---
23+
24+
## 1 — `ModuleNotFoundError: No module named 'verl'`
25+
26+
**Symptom**: container started, code transferred, `pip install -e .[dev,gpu]`
27+
ran fine, then `python -m verl.trainer.main_ppo` failed at import.
28+
29+
**Root cause**: `verlai/verl:vllm012.latest` is a *base image* — it has
30+
verl's dependencies (torch, vllm, ray) but not verl itself. The image
31+
name is misleading.
32+
33+
**Fix**: clone + pip install verl inside the job, pinned to the same
34+
version `llm-starter` uses.
35+
36+
```bash
37+
git clone https://github.com/volcengine/verl.git -b v0.7.0 --depth 1
38+
pip install -e verl --quiet
39+
```
40+
41+
**Lesson**: never assume an image with a project's name has the project
42+
*installed*. `docker run --rm <image> python -c "import <pkg>"` settles
43+
it in 30 seconds.
44+
45+
---
46+
47+
## 2 — Hydra: "Primary config directory not found"
48+
49+
**Symptom**: `--config-path=configs --config-name=grpo_qwen1_5b` printed
50+
the error `Primary config directory not found. Check that the config
51+
directory '.../verl/verl/trainer/configs' exists`.
52+
53+
**Root cause**: Hydra resolves `--config-path` **relative to the entry
54+
point's source dir**, not the cwd. Entry point was `verl.trainer.main_ppo`,
55+
so Hydra looked for `verl/trainer/configs/` — i.e., inside verl's source.
56+
57+
**Fix**: pass an absolute path computed at runtime.
58+
59+
```bash
60+
CONFIG_PATH_ABS="$(pwd)/configs"
61+
python -m verl.trainer.main_ppo --config-path="${CONFIG_PATH_ABS}" ...
62+
```
63+
64+
**Lesson**: Hydra's `--config-path` semantics differ from most CLI tools.
65+
For invocations of third-party Hydra apps, always pass absolute paths.
66+
67+
---
68+
69+
## 3, 4, 5 — `omegaconf.errors.ConfigAttributeError: Key '...' is not in struct`
70+
71+
**Symptoms** (three rounds, each blocking the next):
72+
- `Key 'ray_kwargs' is not in struct`
73+
- `Key 'transfer_queue' is not in struct`
74+
- `Key 'global_profiler' is not in struct`
75+
76+
**Root cause**: verl 0.7's config is in struct mode (strict schema). It
77+
expects ~50 top-level keys. Our minimal YAML only had a handful.
78+
79+
**Fix #1 (additive, then abandoned)**: add each missing key. After three,
80+
clearly heading toward N more rounds.
81+
82+
**Fix #2 (one-shot)**: Hydra inheritance from verl's full default config.
83+
84+
```yaml
85+
# In configs/grpo_qwen1_5b.yaml:
86+
defaults:
87+
- ppo_trainer
88+
- _self_
89+
```
90+
91+
```bash
92+
# In train_grpo.sh:
93+
VERL_CONFIG_DIR="$(pwd)/verl/verl/trainer/config"
94+
python -m verl.trainer.main_ppo \
95+
--config-path="${CONFIG_PATH_ABS}" \
96+
--config-name="${CONFIG_NAME}" \
97+
"hydra.searchpath=[${VERL_CONFIG_DIR}]" \
98+
...
99+
```
100+
101+
**Lesson**: when a third-party tool's strict-mode config has many fields,
102+
inherit its full default config rather than cherry-picking what looks
103+
relevant. Two errors of the same shape = signal to switch to inheritance.
104+
105+
---
106+
107+
## 6 — `TypeError: 'NoneType' object is not subscriptable` on val data
108+
109+
**Symptom**: verl tried to `copy_to_local(src=...)` on the val parquet
110+
where `src` was `None`. Our config had `data.val_files: null`.
111+
112+
**Root cause**: verl always *loads* a val dataset, even when validation
113+
itself is disabled (`test_freq: -1`, `val_before_train: false`).
114+
115+
**Fix**: point `data.val_files` at the train parquet. Verl loads it but
116+
never iterates because validation is disabled.
117+
118+
```yaml
119+
data:
120+
train_files: results/grpo_dataset/v1/train.parquet
121+
val_files: results/grpo_dataset/v1/train.parquet # placeholder; never iterated
122+
```
123+
124+
**Lesson**: with strict-mode tools, "disabled" features can still require
125+
non-null inputs. Read the code path, not just the docs.
126+
127+
---
128+
129+
## 7 — `ValueError: Load format 'dummy_dtensor' is not supported`
130+
131+
**Symptom**: vLLM rollout engine refused to start.
132+
133+
**Root cause**: `load_format: dummy_dtensor` was a verl-specific name from
134+
older vLLM versions. Modern vLLM (in our image) doesn't recognize it.
135+
136+
**Fix**: drop the line entirely. Let verl's default value (inherited from
137+
`ppo_trainer.yaml` via Hydra) pick a compatible value.
138+
139+
```yaml
140+
rollout:
141+
# load_format inherited — modern vLLM rejects "dummy_dtensor"
142+
...
143+
```
144+
145+
**Lesson**: when in doubt, omit explicit values and let inheritance from
146+
the upstream's defaults win. Override only what we deliberately customize.
147+
148+
---
149+
150+
## 8 — `ImportError: attempted relative import with no known parent package`
151+
152+
**Symptom**: verl imported our reward function and crashed at
153+
`from ..agents.verifier import SubprocessVerifier`.
154+
155+
**Root cause**: verl's `load_module(path)` loads our `grpo_reward.py` *by
156+
file path*, outside any package context. Relative imports
157+
(`from ..agents...`) need a parent package.
158+
159+
**Fix**: switch to absolute imports — they always resolve via `sys.path`,
160+
regardless of how the module was loaded.
161+
162+
```python
163+
# Before
164+
from ..agents.verifier import SubprocessVerifier
165+
166+
# After
167+
from verifiable_rl_coder.agents.verifier import SubprocessVerifier
168+
```
169+
170+
**Lesson**: any module a third-party tool loads via `importlib.util.spec_from_file_location`
171+
or similar must use absolute imports. Relative imports are an internal
172+
package convention, not portable.
173+
174+
---
175+
176+
## 9 — GPU OOM (worker SIGKILLed mid-rollout)
177+
178+
**Symptom**: `Worker unexpectedly exits with a connection error code 2.
179+
End of file. ... process killed by the OOM killer`.
180+
181+
**Root cause**: 1.5B model + Adam optimizer state (12 GB) + reference
182+
policy (3 GB) + vLLM rollout cache (60% of 80 GB = 48 GB) + activations
183+
≈ 84 GB. Single A100 has 80 GB. Tight.
184+
185+
**Fix**: enable CPU offloading for non-rollout-critical state.
186+
187+
```yaml
188+
actor_rollout_ref:
189+
actor:
190+
fsdp_config:
191+
param_offload: true # was false
192+
optimizer_offload: true # was false
193+
rollout:
194+
gpu_memory_utilization: 0.4 # was 0.6 — gives back ~16 GB to training
195+
ref:
196+
fsdp_config:
197+
param_offload: true # was false
198+
```
199+
200+
This frees ~30 GB on the GPU (Adam state + ref policy go to CPU; vLLM
201+
gives back rollout-cache memory).
202+
203+
**Lesson**: GPU OOM at the policy-training boundary is normal — every
204+
RL framework has this dance. Offloading is the standard remedy.
205+
206+
---
207+
208+
## 10 — CPU OOM at 32 GB host (Ray worker killed)
209+
210+
**Symptom**: `OutOfMemoryError: Task was killed due to the node running
211+
low on memory. Memory on the node was 30.91GB / 32.00GB`.
212+
213+
**Root cause**: previous fix moved Adam + ref to CPU. One Ray worker now
214+
holds ~16 GB. Plus Ray's object store reservation (~30% of host ≈ 10 GB),
215+
plus vLLM CPU side, plus 8 idle Ray workers. Easily 30+ GB. Submit file
216+
asked for 32 GB.
217+
218+
**Fix**: bump `request_memory` in submit file. After two iterations:
219+
220+
```
221+
request_cpus = 4 # was 8 — fewer Ray workers, less RAM
222+
request_memory = 96GB # was 32 → 64 → 96
223+
```
224+
225+
**Lesson**: when you turn on CPU offload, host memory budget needs to
226+
absorb everything you just freed from GPU. Account for Ray's object
227+
store reservation (default ~30% of total) on top.
228+
229+
---
230+
231+
## 11 — Over-provisioned disk (200 GB)
232+
233+
**Symptom**: my first guess. User pushed back: "I'm only training 1.5B,
234+
why 200 GB?"
235+
236+
**Root cause**: I sized for full run (10 checkpoints × 15 GB each = 150 GB)
237+
without checking whether checkpoint pruning was on.
238+
239+
**Fix**: enable checkpoint pruning in verl config; lower disk request.
240+
241+
```yaml
242+
trainer:
243+
max_actor_ckpt_to_keep: 2 # only keep 2 most recent
244+
```
245+
246+
```
247+
request_disk = 80GB # was 200GB
248+
```
249+
250+
Disk math: base+merged (6) + HF cache (3) + 2 checkpoints capped (30) +
251+
ray spills (5) + buffer (10) ≈ 55 GB. 80 GB is enough.
252+
253+
**Lesson**: always ask "what does this resource ask buy me?" before
254+
defaulting upward. A user reviewing config is a sanity check; honor it.
255+
256+
---
257+
258+
## Final working sizing (1.5B GRPO with offloading)
259+
260+
```
261+
request_cpus = 4
262+
request_memory = 96GB
263+
request_disk = 80GB
264+
request_gpus = 1
265+
require_gpus = (GlobalMemoryMb >= 40000) && (Capability >= 8.0)
266+
+GPUJobLength = "medium" # up to 24h
267+
```
268+
269+
```yaml
270+
actor_rollout_ref:
271+
actor:
272+
fsdp_config:
273+
param_offload: true
274+
optimizer_offload: true
275+
rollout:
276+
gpu_memory_utilization: 0.4
277+
ref:
278+
fsdp_config:
279+
param_offload: true
280+
281+
trainer:
282+
max_actor_ckpt_to_keep: 2
283+
```
284+
285+
Smoke run (50 steps): ~1.5 hours wall clock from job-start to checkpoint.
286+
287+
---
288+
289+
## Rules of thumb banked from this exercise
290+
291+
1. **Hydra `--config-path` is relative to the entry point's source.** Use
292+
absolute paths when invoking third-party Hydra apps.
293+
2. **Inherit from upstream's defaults** rather than handcrafting strict-
294+
mode configs key-by-key. Two same-shape errors = pivot to inheritance.
295+
3. **Absolute imports inside any module a third-party loads by path.**
296+
4. **`Disabled` features can still need valid inputs.** Read the
297+
code path, not just the docstring.
298+
5. **GPU OOM remedy is offload-then-grow-host-RAM.** Account for Ray's
299+
object store (~30% of host RAM by default) when sizing the host.
300+
6. **Sized resources up. User pushback is a real review.** Cite math when
301+
asking for compute, and revisit when challenged.
302+
7. **Checkpoint pruning is non-optional** for any RL run that lasts more
303+
than a few save_freq cycles. `max_actor_ckpt_to_keep: 2` is sane default.
304+
8. **Smoke first, full second** — every fix cycle is a test of the smoke
305+
pipeline. Burning a 10-hour full run on a misconfigured pipeline is
306+
the only kind of waste worth fearing.
307+
308+
## Generalizing for 7B (estimate, untested)
309+
310+
```
311+
request_cpus = 8
312+
request_memory = 160GB # ~2.5x 1.5B (Adam state alone is ~56 GB)
313+
request_disk = 160GB # checkpoints are ~50 GB each
314+
request_gpus = 4
315+
```
316+
317+
```yaml
318+
actor_rollout_ref:
319+
rollout:
320+
gpu_memory_utilization: 0.5 # tighter at 7B
321+
tensor_model_parallel_size: 2
322+
actor:
323+
fsdp_config:
324+
optimizer_offload: true # mandatory at 7B
325+
```
326+
327+
Bump and verify when we get there in Week 5.

0 commit comments

Comments
 (0)