|
| 1 | +# GRPO + verl + CHTC — debugging log |
| 2 | + |
| 3 | +> Iterative log of every issue we hit getting `verl.trainer.main_ppo` to |
| 4 | +> actually run a smoke GRPO step on CHTC, in order of occurrence. Read top |
| 5 | +> to bottom to see the failure cascade; each fix unlocks the next layer. |
| 6 | +> Spec: Qwen-2.5-Coder-1.5B + LoRA-merged SFT warm-start, single A100, |
| 7 | +> verlai/verl:vllm012.latest container. |
| 8 | +
|
| 9 | +## Setup that was already working before this list |
| 10 | + |
| 11 | +By Week 4 Task 4 we already had: |
| 12 | +- HTCondor submit + execute scripts (`chtc/train_grpo.{sub,sh}`) |
| 13 | +- `USER` / `LOGNAME` exported (no `/etc/passwd` entry in container) |
| 14 | +- UUID-form `CUDA_VISIBLE_DEVICES` remap to integer index |
| 15 | +- Tarball-of-repo flow from `/staging/` |
| 16 | +- Resilient tar pattern (always emit `checkpoint.tar.gz`, even on training failure, so HTCondor's output transfer never holds the job) |
| 17 | +- Merge LoRA adapter → fresh `merged_<name>` dir as starting policy |
| 18 | +- Pre-warm of evalplus datasets |
| 19 | + |
| 20 | +These came from Weeks 2 and 3 and are documented in their summary files. |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +## 1 — `ModuleNotFoundError: No module named 'verl'` |
| 25 | + |
| 26 | +**Symptom**: container started, code transferred, `pip install -e .[dev,gpu]` |
| 27 | +ran fine, then `python -m verl.trainer.main_ppo` failed at import. |
| 28 | + |
| 29 | +**Root cause**: `verlai/verl:vllm012.latest` is a *base image* — it has |
| 30 | +verl's dependencies (torch, vllm, ray) but not verl itself. The image |
| 31 | +name is misleading. |
| 32 | + |
| 33 | +**Fix**: clone + pip install verl inside the job, pinned to the same |
| 34 | +version `llm-starter` uses. |
| 35 | + |
| 36 | +```bash |
| 37 | +git clone https://github.com/volcengine/verl.git -b v0.7.0 --depth 1 |
| 38 | +pip install -e verl --quiet |
| 39 | +``` |
| 40 | + |
| 41 | +**Lesson**: never assume an image with a project's name has the project |
| 42 | +*installed*. `docker run --rm <image> python -c "import <pkg>"` settles |
| 43 | +it in 30 seconds. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## 2 — Hydra: "Primary config directory not found" |
| 48 | + |
| 49 | +**Symptom**: `--config-path=configs --config-name=grpo_qwen1_5b` printed |
| 50 | +the error `Primary config directory not found. Check that the config |
| 51 | +directory '.../verl/verl/trainer/configs' exists`. |
| 52 | + |
| 53 | +**Root cause**: Hydra resolves `--config-path` **relative to the entry |
| 54 | +point's source dir**, not the cwd. Entry point was `verl.trainer.main_ppo`, |
| 55 | +so Hydra looked for `verl/trainer/configs/` — i.e., inside verl's source. |
| 56 | + |
| 57 | +**Fix**: pass an absolute path computed at runtime. |
| 58 | + |
| 59 | +```bash |
| 60 | +CONFIG_PATH_ABS="$(pwd)/configs" |
| 61 | +python -m verl.trainer.main_ppo --config-path="${CONFIG_PATH_ABS}" ... |
| 62 | +``` |
| 63 | + |
| 64 | +**Lesson**: Hydra's `--config-path` semantics differ from most CLI tools. |
| 65 | +For invocations of third-party Hydra apps, always pass absolute paths. |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +## 3, 4, 5 — `omegaconf.errors.ConfigAttributeError: Key '...' is not in struct` |
| 70 | + |
| 71 | +**Symptoms** (three rounds, each blocking the next): |
| 72 | +- `Key 'ray_kwargs' is not in struct` |
| 73 | +- `Key 'transfer_queue' is not in struct` |
| 74 | +- `Key 'global_profiler' is not in struct` |
| 75 | + |
| 76 | +**Root cause**: verl 0.7's config is in struct mode (strict schema). It |
| 77 | +expects ~50 top-level keys. Our minimal YAML only had a handful. |
| 78 | + |
| 79 | +**Fix #1 (additive, then abandoned)**: add each missing key. After three, |
| 80 | +clearly heading toward N more rounds. |
| 81 | + |
| 82 | +**Fix #2 (one-shot)**: Hydra inheritance from verl's full default config. |
| 83 | + |
| 84 | +```yaml |
| 85 | +# In configs/grpo_qwen1_5b.yaml: |
| 86 | +defaults: |
| 87 | + - ppo_trainer |
| 88 | + - _self_ |
| 89 | +``` |
| 90 | +
|
| 91 | +```bash |
| 92 | +# In train_grpo.sh: |
| 93 | +VERL_CONFIG_DIR="$(pwd)/verl/verl/trainer/config" |
| 94 | +python -m verl.trainer.main_ppo \ |
| 95 | + --config-path="${CONFIG_PATH_ABS}" \ |
| 96 | + --config-name="${CONFIG_NAME}" \ |
| 97 | + "hydra.searchpath=[${VERL_CONFIG_DIR}]" \ |
| 98 | + ... |
| 99 | +``` |
| 100 | + |
| 101 | +**Lesson**: when a third-party tool's strict-mode config has many fields, |
| 102 | +inherit its full default config rather than cherry-picking what looks |
| 103 | +relevant. Two errors of the same shape = signal to switch to inheritance. |
| 104 | + |
| 105 | +--- |
| 106 | + |
| 107 | +## 6 — `TypeError: 'NoneType' object is not subscriptable` on val data |
| 108 | + |
| 109 | +**Symptom**: verl tried to `copy_to_local(src=...)` on the val parquet |
| 110 | +where `src` was `None`. Our config had `data.val_files: null`. |
| 111 | + |
| 112 | +**Root cause**: verl always *loads* a val dataset, even when validation |
| 113 | +itself is disabled (`test_freq: -1`, `val_before_train: false`). |
| 114 | + |
| 115 | +**Fix**: point `data.val_files` at the train parquet. Verl loads it but |
| 116 | +never iterates because validation is disabled. |
| 117 | + |
| 118 | +```yaml |
| 119 | +data: |
| 120 | + train_files: results/grpo_dataset/v1/train.parquet |
| 121 | + val_files: results/grpo_dataset/v1/train.parquet # placeholder; never iterated |
| 122 | +``` |
| 123 | +
|
| 124 | +**Lesson**: with strict-mode tools, "disabled" features can still require |
| 125 | +non-null inputs. Read the code path, not just the docs. |
| 126 | +
|
| 127 | +--- |
| 128 | +
|
| 129 | +## 7 — `ValueError: Load format 'dummy_dtensor' is not supported` |
| 130 | + |
| 131 | +**Symptom**: vLLM rollout engine refused to start. |
| 132 | + |
| 133 | +**Root cause**: `load_format: dummy_dtensor` was a verl-specific name from |
| 134 | +older vLLM versions. Modern vLLM (in our image) doesn't recognize it. |
| 135 | + |
| 136 | +**Fix**: drop the line entirely. Let verl's default value (inherited from |
| 137 | +`ppo_trainer.yaml` via Hydra) pick a compatible value. |
| 138 | + |
| 139 | +```yaml |
| 140 | +rollout: |
| 141 | + # load_format inherited — modern vLLM rejects "dummy_dtensor" |
| 142 | + ... |
| 143 | +``` |
| 144 | + |
| 145 | +**Lesson**: when in doubt, omit explicit values and let inheritance from |
| 146 | +the upstream's defaults win. Override only what we deliberately customize. |
| 147 | + |
| 148 | +--- |
| 149 | + |
| 150 | +## 8 — `ImportError: attempted relative import with no known parent package` |
| 151 | + |
| 152 | +**Symptom**: verl imported our reward function and crashed at |
| 153 | +`from ..agents.verifier import SubprocessVerifier`. |
| 154 | + |
| 155 | +**Root cause**: verl's `load_module(path)` loads our `grpo_reward.py` *by |
| 156 | +file path*, outside any package context. Relative imports |
| 157 | +(`from ..agents...`) need a parent package. |
| 158 | + |
| 159 | +**Fix**: switch to absolute imports — they always resolve via `sys.path`, |
| 160 | +regardless of how the module was loaded. |
| 161 | + |
| 162 | +```python |
| 163 | +# Before |
| 164 | +from ..agents.verifier import SubprocessVerifier |
| 165 | +
|
| 166 | +# After |
| 167 | +from verifiable_rl_coder.agents.verifier import SubprocessVerifier |
| 168 | +``` |
| 169 | + |
| 170 | +**Lesson**: any module a third-party tool loads via `importlib.util.spec_from_file_location` |
| 171 | +or similar must use absolute imports. Relative imports are an internal |
| 172 | +package convention, not portable. |
| 173 | + |
| 174 | +--- |
| 175 | + |
| 176 | +## 9 — GPU OOM (worker SIGKILLed mid-rollout) |
| 177 | + |
| 178 | +**Symptom**: `Worker unexpectedly exits with a connection error code 2. |
| 179 | +End of file. ... process killed by the OOM killer`. |
| 180 | + |
| 181 | +**Root cause**: 1.5B model + Adam optimizer state (12 GB) + reference |
| 182 | +policy (3 GB) + vLLM rollout cache (60% of 80 GB = 48 GB) + activations |
| 183 | +≈ 84 GB. Single A100 has 80 GB. Tight. |
| 184 | + |
| 185 | +**Fix**: enable CPU offloading for non-rollout-critical state. |
| 186 | + |
| 187 | +```yaml |
| 188 | +actor_rollout_ref: |
| 189 | + actor: |
| 190 | + fsdp_config: |
| 191 | + param_offload: true # was false |
| 192 | + optimizer_offload: true # was false |
| 193 | + rollout: |
| 194 | + gpu_memory_utilization: 0.4 # was 0.6 — gives back ~16 GB to training |
| 195 | + ref: |
| 196 | + fsdp_config: |
| 197 | + param_offload: true # was false |
| 198 | +``` |
| 199 | + |
| 200 | +This frees ~30 GB on the GPU (Adam state + ref policy go to CPU; vLLM |
| 201 | +gives back rollout-cache memory). |
| 202 | + |
| 203 | +**Lesson**: GPU OOM at the policy-training boundary is normal — every |
| 204 | +RL framework has this dance. Offloading is the standard remedy. |
| 205 | + |
| 206 | +--- |
| 207 | + |
| 208 | +## 10 — CPU OOM at 32 GB host (Ray worker killed) |
| 209 | + |
| 210 | +**Symptom**: `OutOfMemoryError: Task was killed due to the node running |
| 211 | +low on memory. Memory on the node was 30.91GB / 32.00GB`. |
| 212 | + |
| 213 | +**Root cause**: previous fix moved Adam + ref to CPU. One Ray worker now |
| 214 | +holds ~16 GB. Plus Ray's object store reservation (~30% of host ≈ 10 GB), |
| 215 | +plus vLLM CPU side, plus 8 idle Ray workers. Easily 30+ GB. Submit file |
| 216 | +asked for 32 GB. |
| 217 | + |
| 218 | +**Fix**: bump `request_memory` in submit file. After two iterations: |
| 219 | + |
| 220 | +``` |
| 221 | +request_cpus = 4 # was 8 — fewer Ray workers, less RAM |
| 222 | +request_memory = 96GB # was 32 → 64 → 96 |
| 223 | +``` |
| 224 | +
|
| 225 | +**Lesson**: when you turn on CPU offload, host memory budget needs to |
| 226 | +absorb everything you just freed from GPU. Account for Ray's object |
| 227 | +store reservation (default ~30% of total) on top. |
| 228 | +
|
| 229 | +--- |
| 230 | +
|
| 231 | +## 11 — Over-provisioned disk (200 GB) |
| 232 | +
|
| 233 | +**Symptom**: my first guess. User pushed back: "I'm only training 1.5B, |
| 234 | +why 200 GB?" |
| 235 | +
|
| 236 | +**Root cause**: I sized for full run (10 checkpoints × 15 GB each = 150 GB) |
| 237 | +without checking whether checkpoint pruning was on. |
| 238 | +
|
| 239 | +**Fix**: enable checkpoint pruning in verl config; lower disk request. |
| 240 | +
|
| 241 | +```yaml |
| 242 | +trainer: |
| 243 | + max_actor_ckpt_to_keep: 2 # only keep 2 most recent |
| 244 | +``` |
| 245 | + |
| 246 | +``` |
| 247 | +request_disk = 80GB # was 200GB |
| 248 | +``` |
| 249 | + |
| 250 | +Disk math: base+merged (6) + HF cache (3) + 2 checkpoints capped (30) + |
| 251 | +ray spills (5) + buffer (10) ≈ 55 GB. 80 GB is enough. |
| 252 | + |
| 253 | +**Lesson**: always ask "what does this resource ask buy me?" before |
| 254 | +defaulting upward. A user reviewing config is a sanity check; honor it. |
| 255 | + |
| 256 | +--- |
| 257 | + |
| 258 | +## Final working sizing (1.5B GRPO with offloading) |
| 259 | + |
| 260 | +``` |
| 261 | +request_cpus = 4 |
| 262 | +request_memory = 96GB |
| 263 | +request_disk = 80GB |
| 264 | +request_gpus = 1 |
| 265 | +require_gpus = (GlobalMemoryMb >= 40000) && (Capability >= 8.0) |
| 266 | ++GPUJobLength = "medium" # up to 24h |
| 267 | +``` |
| 268 | + |
| 269 | +```yaml |
| 270 | +actor_rollout_ref: |
| 271 | + actor: |
| 272 | + fsdp_config: |
| 273 | + param_offload: true |
| 274 | + optimizer_offload: true |
| 275 | + rollout: |
| 276 | + gpu_memory_utilization: 0.4 |
| 277 | + ref: |
| 278 | + fsdp_config: |
| 279 | + param_offload: true |
| 280 | + |
| 281 | +trainer: |
| 282 | + max_actor_ckpt_to_keep: 2 |
| 283 | +``` |
| 284 | +
|
| 285 | +Smoke run (50 steps): ~1.5 hours wall clock from job-start to checkpoint. |
| 286 | +
|
| 287 | +--- |
| 288 | +
|
| 289 | +## Rules of thumb banked from this exercise |
| 290 | +
|
| 291 | +1. **Hydra `--config-path` is relative to the entry point's source.** Use |
| 292 | + absolute paths when invoking third-party Hydra apps. |
| 293 | +2. **Inherit from upstream's defaults** rather than handcrafting strict- |
| 294 | + mode configs key-by-key. Two same-shape errors = pivot to inheritance. |
| 295 | +3. **Absolute imports inside any module a third-party loads by path.** |
| 296 | +4. **`Disabled` features can still need valid inputs.** Read the |
| 297 | + code path, not just the docstring. |
| 298 | +5. **GPU OOM remedy is offload-then-grow-host-RAM.** Account for Ray's |
| 299 | + object store (~30% of host RAM by default) when sizing the host. |
| 300 | +6. **Sized resources up. User pushback is a real review.** Cite math when |
| 301 | + asking for compute, and revisit when challenged. |
| 302 | +7. **Checkpoint pruning is non-optional** for any RL run that lasts more |
| 303 | + than a few save_freq cycles. `max_actor_ckpt_to_keep: 2` is sane default. |
| 304 | +8. **Smoke first, full second** — every fix cycle is a test of the smoke |
| 305 | + pipeline. Burning a 10-hour full run on a misconfigured pipeline is |
| 306 | + the only kind of waste worth fearing. |
| 307 | + |
| 308 | +## Generalizing for 7B (estimate, untested) |
| 309 | + |
| 310 | +``` |
| 311 | +request_cpus = 8 |
| 312 | +request_memory = 160GB # ~2.5x 1.5B (Adam state alone is ~56 GB) |
| 313 | +request_disk = 160GB # checkpoints are ~50 GB each |
| 314 | +request_gpus = 4 |
| 315 | +``` |
| 316 | +
|
| 317 | +```yaml |
| 318 | +actor_rollout_ref: |
| 319 | + rollout: |
| 320 | + gpu_memory_utilization: 0.5 # tighter at 7B |
| 321 | + tensor_model_parallel_size: 2 |
| 322 | + actor: |
| 323 | + fsdp_config: |
| 324 | + optimizer_offload: true # mandatory at 7B |
| 325 | +``` |
| 326 | + |
| 327 | +Bump and verify when we get there in Week 5. |
0 commit comments