[fsdp] fix: add aggressive_empty_cache at end of init_model to prevent vLLM OOM (verl-project#5384)

EricMarcus-ai · Superjomn · commit c10125ca5d51 · 2026-03-02T14:02:14.000+08:00
### What does this PR do? Adds `aggressive_empty_cache(force_sync=True)` at the end of `ActorRolloutRefWorker.init_model()` to prevent vLLM from OOMing at startup when colocated on the same GPUs as FSDP. Related: verl-project#4229, verl-project#4257 (stale) After the removal of `ExternalZeroMQDistributedExecutor`, vLLM runs in separate MP worker processes instead of inside the FSDP worker process. During `init_model()`, PyTorch's CUDA allocator reserves large transient blocks for full-model loading before FSDP sharding and `sync_module_states` broadcasting. After init, these blocks are no longer needed but remain cached by the allocator (`cudaMalloc`'d, not `cudaFree`'d). Since vLLM now runs in a separate process with its own allocator, it cannot reuse these cached blocks — `cudaMemGetInfo` reports them as "used", and vLLM fails its `gpu_memory_utilization` check. Previous attempts to fix this (verl-project#4257) went stale. This approach is simpler: one line, no guards needed, and is a no-op when there is nothing to free. ### Checklist Before Starting - [x] Search for similar PRs: [aggressive_empty_cache](https://github.com/verl-project/verl/pulls?q=aggressive_empty_cache), [OOM fsdp vllm init](https://github.com/verl-project/verl/pulls?q=OOM+fsdp+vllm+init) - [x] Format the PR title as `[{modules}] {type}: {description}` ### Test This cannot be tested in CI because the OOM is a cross-process CUDA memory visibility issue that requires colocated FSDP + vLLM on the same physical GPU to reproduce. - The call site is exercised by the existing `tests/workers/test_fsdp_workers.py` - Validated experimentally with a colocated FSDP + vLLM training run (8B VLM, 8x H200, hybrid mode) - The fix is a no-op when there is no cached memory to free, so it is safe in all configurations ### API and Usage Example No API changes. The fix is automatic. ### Design & Code Changes One line added at the end of `ActorRolloutRefWorker.init_model()` in `verl/workers/fsdp_workers.py`: ```python # Free cached GPU memory so colocated vLLM processes can see it via cudaMemGetInfo aggressive_empty_cache(force_sync=True) ``` This pattern already exists in the codebase: - `megatron_workers.py:677` — `empty_cache()` at end of `init_model` - `fsdp_workers.py:742` — `aggressive_empty_cache` during `rollout_mode()` context switch - `engine_workers.py:671` — `aggressive_empty_cache` during rollout mode The FSDP worker was the only one missing it at init time. ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): all hooks pass. - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). — N/A, no user-facing changes. - [x] Add unit or end-to-end test(s) — not feasible: requires multi-process colocated GPU setup to reproduce; the fix is exercised by existing `test_fsdp_workers.py`. - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1). - [x] Not related to `recipe` submodule.
diff --git a/verl/workers/engine_workers.py b/verl/workers/engine_workers.py
@@ -583,6 +583,9 @@ def init_model(self):
                 backend, is_master=(torch.distributed.get_rank() == 0), bucket_size=bucket_size, **engine_kwargs
             )
 
+        # Free cached GPU memory so colocated vLLM processes can see it via cudaMemGetInfo
+        aggressive_empty_cache(force_sync=True)
+
     @register(dispatch_mode=make_nd_compute_dataproto_dispatch_fn(mesh_name="ref"))
     @DistProfiler.annotate(color="olive", role="ref_compute_log_prob")
     @_with_routing_replay_flag(enabled=False)
diff --git a/verl/workers/fsdp_workers.py b/verl/workers/fsdp_workers.py
@@ -981,6 +981,9 @@ def init_model(self):
                 checkpoint_config=checkpoint_contents,
             )
 
+        # Free cached GPU memory so colocated vLLM processes can see it via cudaMemGetInfo
+        aggressive_empty_cache(force_sync=True)
+
     @register(dispatch_mode=make_nd_compute_dataproto_dispatch_fn(mesh_name="actor"))
     @DistProfiler.annotate(color="red", role="actor_update")
     def update_actor(self, data: DataProto):
diff --git a/verl/workers/megatron_workers.py b/verl/workers/megatron_workers.py
@@ -674,7 +674,8 @@ def init_model(self):
             if not self.config.actor.megatron.use_mbridge:
                 self.weight_converter = get_mcore_weight_converter(self.actor_model_config, self.dtype)
 
-        get_torch_device().empty_cache()
+        # Free cached GPU memory so colocated vLLM processes can see it via cudaMemGetInfo
+        aggressive_empty_cache(force_sync=True)
         log_gpu_memory_usage("After init_model finish", logger=logger)
 
     async def rollout_mode(self):

Original file line number	Diff line number	Diff line change
`@@ -583,6 +583,9 @@ def init_model(self):`
`583`	`583`	`backend, is_master=(torch.distributed.get_rank() == 0), bucket_size=bucket_size, **engine_kwargs`
`584`	`584`	`)`
`585`	`585`
	`586`	`+ # Free cached GPU memory so colocated vLLM processes can see it via cudaMemGetInfo`
	`587`	`+ aggressive_empty_cache(force_sync=True)`
	`588`	`+`
`586`	`589`	`@register(dispatch_mode=make_nd_compute_dataproto_dispatch_fn(mesh_name="ref"))`
`587`	`590`	`@DistProfiler.annotate(color="olive", role="ref_compute_log_prob")`
`588`	`591`	`@_with_routing_replay_flag(enabled=False)`
Original file line number	Diff line number	Diff line change
`@@ -981,6 +981,9 @@ def init_model(self):`
`981`	`981`	`checkpoint_config=checkpoint_contents,`
`982`	`982`	`)`
`983`	`983`
	`984`	`+ # Free cached GPU memory so colocated vLLM processes can see it via cudaMemGetInfo`
	`985`	`+ aggressive_empty_cache(force_sync=True)`
	`986`	`+`
`984`	`987`	`@register(dispatch_mode=make_nd_compute_dataproto_dispatch_fn(mesh_name="actor"))`
`985`	`988`	`@DistProfiler.annotate(color="red", role="actor_update")`
`986`	`989`	`def update_actor(self, data: DataProto):`