docs: grpo-chtc-debugging-log — 11 issues hit during week 4 task 5, in order

Devesh-Maheshwari · Devesh-Maheshwari · commit 4378b8affb18 · 2026-04-25T14:36:54.000-05:00
diff --git a/docs/grpo-chtc-debugging-log.md b/docs/grpo-chtc-debugging-log.md
@@ -0,0 +1,327 @@
+# GRPO + verl + CHTC — debugging log
+
+> Iterative log of every issue we hit getting `verl.trainer.main_ppo` to
+> actually run a smoke GRPO step on CHTC, in order of occurrence. Read top
+> to bottom to see the failure cascade; each fix unlocks the next layer.
+> Spec: Qwen-2.5-Coder-1.5B + LoRA-merged SFT warm-start, single A100,
+> verlai/verl:vllm012.latest container.
+
+## Setup that was already working before this list
+
+By Week 4 Task 4 we already had:
+- HTCondor submit + execute scripts (`chtc/train_grpo.{sub,sh}`)
+- `USER` / `LOGNAME` exported (no `/etc/passwd` entry in container)
+- UUID-form `CUDA_VISIBLE_DEVICES` remap to integer index
+- Tarball-of-repo flow from `/staging/`
+- Resilient tar pattern (always emit `checkpoint.tar.gz`, even on training failure, so HTCondor's output transfer never holds the job)
+- Merge LoRA adapter → fresh `merged_<name>` dir as starting policy
+- Pre-warm of evalplus datasets
+
+These came from Weeks 2 and 3 and are documented in their summary files.
+
+---
+
+## 1 — `ModuleNotFoundError: No module named 'verl'`
+
+**Symptom**: container started, code transferred, `pip install -e .[dev,gpu]`
+ran fine, then `python -m verl.trainer.main_ppo` failed at import.
+
+**Root cause**: `verlai/verl:vllm012.latest` is a *base image* — it has
+verl's dependencies (torch, vllm, ray) but not verl itself. The image
+name is misleading.
+
+**Fix**: clone + pip install verl inside the job, pinned to the same
+version `llm-starter` uses.
+
+```bash
+git clone https://github.com/volcengine/verl.git -b v0.7.0 --depth 1
+pip install -e verl --quiet
+```
+
+**Lesson**: never assume an image with a project's name has the project
+*installed*. `docker run --rm <image> python -c "import <pkg>"` settles
+it in 30 seconds.
+
+---
+
+## 2 — Hydra: "Primary config directory not found"
+
+**Symptom**: `--config-path=configs --config-name=grpo_qwen1_5b` printed
+the error `Primary config directory not found. Check that the config
+directory '.../verl/verl/trainer/configs' exists`.
+
+**Root cause**: Hydra resolves `--config-path` **relative to the entry
+point's source dir**, not the cwd. Entry point was `verl.trainer.main_ppo`,
+so Hydra looked for `verl/trainer/configs/` — i.e., inside verl's source.
+
+**Fix**: pass an absolute path computed at runtime.
+
+```bash
+CONFIG_PATH_ABS="$(pwd)/configs"
+python -m verl.trainer.main_ppo --config-path="${CONFIG_PATH_ABS}" ...
+```
+
+**Lesson**: Hydra's `--config-path` semantics differ from most CLI tools.
+For invocations of third-party Hydra apps, always pass absolute paths.
+
+---
+
+## 3, 4, 5 — `omegaconf.errors.ConfigAttributeError: Key '...' is not in struct`
+
+**Symptoms** (three rounds, each blocking the next):
+- `Key 'ray_kwargs' is not in struct`
+- `Key 'transfer_queue' is not in struct`
+- `Key 'global_profiler' is not in struct`
+
+**Root cause**: verl 0.7's config is in struct mode (strict schema). It
+expects ~50 top-level keys. Our minimal YAML only had a handful.
+
+**Fix #1 (additive, then abandoned)**: add each missing key. After three,
+clearly heading toward N more rounds.
+
+**Fix #2 (one-shot)**: Hydra inheritance from verl's full default config.
+
+```yaml
+# In configs/grpo_qwen1_5b.yaml:
+defaults:
+  - ppo_trainer
+  - _self_
+```
+
+```bash
+# In train_grpo.sh:
+VERL_CONFIG_DIR="$(pwd)/verl/verl/trainer/config"
+python -m verl.trainer.main_ppo \
+    --config-path="${CONFIG_PATH_ABS}" \
+    --config-name="${CONFIG_NAME}" \
+    "hydra.searchpath=[${VERL_CONFIG_DIR}]" \
+    ...
+```
+
+**Lesson**: when a third-party tool's strict-mode config has many fields,
+inherit its full default config rather than cherry-picking what looks
+relevant. Two errors of the same shape = signal to switch to inheritance.
+
+---
+
+## 6 — `TypeError: 'NoneType' object is not subscriptable` on val data
+
+**Symptom**: verl tried to `copy_to_local(src=...)` on the val parquet
+where `src` was `None`. Our config had `data.val_files: null`.
+
+**Root cause**: verl always *loads* a val dataset, even when validation
+itself is disabled (`test_freq: -1`, `val_before_train: false`).
+
+**Fix**: point `data.val_files` at the train parquet. Verl loads it but
+never iterates because validation is disabled.
+
+```yaml
+data:
+  train_files: results/grpo_dataset/v1/train.parquet
+  val_files: results/grpo_dataset/v1/train.parquet  # placeholder; never iterated
+```
+
+**Lesson**: with strict-mode tools, "disabled" features can still require
+non-null inputs. Read the code path, not just the docs.
+
+---
+
+## 7 — `ValueError: Load format 'dummy_dtensor' is not supported`
+
+**Symptom**: vLLM rollout engine refused to start.
+
+**Root cause**: `load_format: dummy_dtensor` was a verl-specific name from
+older vLLM versions. Modern vLLM (in our image) doesn't recognize it.
+
+**Fix**: drop the line entirely. Let verl's default value (inherited from
+`ppo_trainer.yaml` via Hydra) pick a compatible value.
+
+```yaml
+rollout:
+  # load_format inherited — modern vLLM rejects "dummy_dtensor"
+  ...
+```
+
+**Lesson**: when in doubt, omit explicit values and let inheritance from
+the upstream's defaults win. Override only what we deliberately customize.
+
+---
+
+## 8 — `ImportError: attempted relative import with no known parent package`
+
+**Symptom**: verl imported our reward function and crashed at
+`from ..agents.verifier import SubprocessVerifier`.
+
+**Root cause**: verl's `load_module(path)` loads our `grpo_reward.py` *by
+file path*, outside any package context. Relative imports
+(`from ..agents...`) need a parent package.
+
+**Fix**: switch to absolute imports — they always resolve via `sys.path`,
+regardless of how the module was loaded.
+
+```python
+# Before
+from ..agents.verifier import SubprocessVerifier
+
+# After
+from verifiable_rl_coder.agents.verifier import SubprocessVerifier
+```
+
+**Lesson**: any module a third-party tool loads via `importlib.util.spec_from_file_location`
+or similar must use absolute imports. Relative imports are an internal
+package convention, not portable.
+
+---
+
+## 9 — GPU OOM (worker SIGKILLed mid-rollout)
+
+**Symptom**: `Worker unexpectedly exits with a connection error code 2.
+End of file. ... process killed by the OOM killer`.
+
+**Root cause**: 1.5B model + Adam optimizer state (12 GB) + reference
+policy (3 GB) + vLLM rollout cache (60% of 80 GB = 48 GB) + activations
+≈ 84 GB. Single A100 has 80 GB. Tight.
+
+**Fix**: enable CPU offloading for non-rollout-critical state.
+
+```yaml
+actor_rollout_ref:
+  actor:
+    fsdp_config:
+      param_offload: true        # was false
+      optimizer_offload: true    # was false
+  rollout:
+    gpu_memory_utilization: 0.4  # was 0.6 — gives back ~16 GB to training
+  ref:
+    fsdp_config:
+      param_offload: true        # was false
+```
+
+This frees ~30 GB on the GPU (Adam state + ref policy go to CPU; vLLM
+gives back rollout-cache memory).
+
+**Lesson**: GPU OOM at the policy-training boundary is normal — every
+RL framework has this dance. Offloading is the standard remedy.
+
+---
+
+## 10 — CPU OOM at 32 GB host (Ray worker killed)
+
+**Symptom**: `OutOfMemoryError: Task was killed due to the node running
+low on memory. Memory on the node was 30.91GB / 32.00GB`.
+
+**Root cause**: previous fix moved Adam + ref to CPU. One Ray worker now
+holds ~16 GB. Plus Ray's object store reservation (~30% of host ≈ 10 GB),
+plus vLLM CPU side, plus 8 idle Ray workers. Easily 30+ GB. Submit file
+asked for 32 GB.
+
+**Fix**: bump `request_memory` in submit file. After two iterations:
+
+```
+request_cpus   = 4    # was 8 — fewer Ray workers, less RAM
+request_memory = 96GB # was 32 → 64 → 96
+```
+
+**Lesson**: when you turn on CPU offload, host memory budget needs to
+absorb everything you just freed from GPU. Account for Ray's object
+store reservation (default ~30% of total) on top.
+
+---
+
+## 11 — Over-provisioned disk (200 GB)
+
+**Symptom**: my first guess. User pushed back: "I'm only training 1.5B,
+why 200 GB?"
+
+**Root cause**: I sized for full run (10 checkpoints × 15 GB each = 150 GB)
+without checking whether checkpoint pruning was on.
+
+**Fix**: enable checkpoint pruning in verl config; lower disk request.
+
+```yaml
+trainer:
+  max_actor_ckpt_to_keep: 2  # only keep 2 most recent
+```
+
+```
+request_disk = 80GB  # was 200GB
+```
+
+Disk math: base+merged (6) + HF cache (3) + 2 checkpoints capped (30) +
+ray spills (5) + buffer (10) ≈ 55 GB. 80 GB is enough.
+
+**Lesson**: always ask "what does this resource ask buy me?" before
+defaulting upward. A user reviewing config is a sanity check; honor it.
+
+---
+
+## Final working sizing (1.5B GRPO with offloading)
+
+```
+request_cpus   = 4
+request_memory = 96GB
+request_disk   = 80GB
+request_gpus   = 1
+require_gpus   = (GlobalMemoryMb >= 40000) && (Capability >= 8.0)
++GPUJobLength  = "medium"  # up to 24h
+```
+
+```yaml
+actor_rollout_ref:
+  actor:
+    fsdp_config:
+      param_offload: true
+      optimizer_offload: true
+  rollout:
+    gpu_memory_utilization: 0.4
+  ref:
+    fsdp_config:
+      param_offload: true
+
+trainer:
+  max_actor_ckpt_to_keep: 2
+```
+
+Smoke run (50 steps): ~1.5 hours wall clock from job-start to checkpoint.
+
+---
+
+## Rules of thumb banked from this exercise
+
+1. **Hydra `--config-path` is relative to the entry point's source.** Use
+   absolute paths when invoking third-party Hydra apps.
+2. **Inherit from upstream's defaults** rather than handcrafting strict-
+   mode configs key-by-key. Two same-shape errors = pivot to inheritance.
+3. **Absolute imports inside any module a third-party loads by path.**
+4. **`Disabled` features can still need valid inputs.** Read the
+   code path, not just the docstring.
+5. **GPU OOM remedy is offload-then-grow-host-RAM.** Account for Ray's
+   object store (~30% of host RAM by default) when sizing the host.
+6. **Sized resources up. User pushback is a real review.** Cite math when
+   asking for compute, and revisit when challenged.
+7. **Checkpoint pruning is non-optional** for any RL run that lasts more
+   than a few save_freq cycles. `max_actor_ckpt_to_keep: 2` is sane default.
+8. **Smoke first, full second** — every fix cycle is a test of the smoke
+   pipeline. Burning a 10-hour full run on a misconfigured pipeline is
+   the only kind of waste worth fearing.
+
+## Generalizing for 7B (estimate, untested)
+
+```
+request_cpus   = 8
+request_memory = 160GB        # ~2.5x 1.5B (Adam state alone is ~56 GB)
+request_disk   = 160GB        # checkpoints are ~50 GB each
+request_gpus   = 4
+```
+
+```yaml
+actor_rollout_ref:
+  rollout:
+    gpu_memory_utilization: 0.5       # tighter at 7B
+    tensor_model_parallel_size: 2
+  actor:
+    fsdp_config:
+      optimizer_offload: true         # mandatory at 7B
+```
+
+Bump and verify when we get there in Week 5.