Goal
Pack more RL training jobs into the same GPU cluster. In our ROLL integration, RLix achieved ~3x rollout throughput gain on SWE-agent RL training by time-sharing GPUs across pipelines instead of exclusive allocation.
Summary
RLix is a Ray-based multi-pipeline GPU orchestration system. Instead of each researcher waiting for an exclusive GPU allocation, RLix lets multiple RL training pipelines share the same GPUs. More experiments run concurrently, less time spent waiting for resources. The scheduler dynamically assigns GPU time across pipelines based on priority and demand. Currently RLix supports ROLL as the training framework; this RFC proposes adding NeMo RL as a second supported framework.
We propose changes to NeMo RL that enable its async GRPO training to participate in RLix multi-pipeline scheduling. The key capabilities needed:
- Scheduler-driven shard resize — the scheduler controls when individual inference DP shards sleep/wake, rather than the training loop doing all-or-nothing sleep/wake
- CPU weight caching + selective sync — after training, weights are cached on CPU and synced only to the specific shards that need updating. In a multi-pipeline setting, we cannot afford to broadcast weights to all vLLM workers across all pipelines — each pipeline must sync independently to its own shards without blocking others
- Progress reporting — pipelines report generation demand so RLix can make informed allocation decisions
One mechanism this enables is partial overlap, where inference GPUs are a superset of training GPUs — overlap GPUs alternate between roles via vLLM sleep(level=2) / wake_up() while non-overlap GPUs continue inference.
Scope: vLLM inference + Megatron training + async GRPO only. All changes are gated behind RLIX_CONTROL_PLANE=rlix env var — standalone NeMo RL behavior is unaffected.
Review Asks
- API shape — Are the proposed new methods on
VLLMGeneration, RayWorkerGroup, and MegatronPolicyWorker reasonable?
- Lifecycle hooks — Is the
RLixHooks protocol (4 methods: before/after_training called from async_grpo_train(), report/clear_progress called from AsyncTrajectoryCollector) an acceptable integration pattern?
Design Overview
Why CPU weight cache + selective sync (not native refit)
In a multi-pipeline setting, each pipeline's training and inference workers share GPU time with other pipelines. Two problems make NeMo RL's native refit_policy_generation() insufficient:
-
Full broadcast causes OOM. refit_policy_generation() broadcasts weights to all inference workers via NCCL collective, requiring the full model staged on GPU while receivers already have their weights loaded. With multiple pipelines sharing GPUs, this is not affordable — each pipeline must sync only to its own shards that actually need updating, not all workers.
-
GPU VRAM conflict. Refit requires training weights on GPU when sending (CUDA IPC handles or NCCL broadcast). But when inference workers need to wake up on the same GPUs, both cannot coexist → OOM.
Solution: CPU weight cache + selective sync. After each training step, snapshot weights to CPU, release the GPU. Sync only to the specific DP shards that need it, in chunks (staged CPU → GPU, freed after send) to control peak VRAM. Other shards and other pipelines are unaffected. The routing layer reuses NeMo RL's existing ZMQ IPC and NCCL broadcast transport — no modifications to transport code.
Abort-drain-sleep for shard preemption
When RLix reclaims GPUs from inference, we cannot simply call sleep() on workers with active requests — vLLM would crash accessing offloaded GPU memory. The safe sequence:
- Mark preempted — block new request dispatch to these shards
- Abort — worker calls
engine.abort() for all running requests internally (no request ID plumbing needed from the generation layer)
- Drain — poll existing vLLM engine metric
vllm:num_requests_running until 0 (already tracked per worker at vllm_worker_async.py:241). Drain timeout with force-sleep fallback in case the metric never reaches 0.
- Sleep — engine confirmed idle, GPU memory safely offloaded
Callers with in-flight requests on preempted shards receive errors, classified as ShardPreemptedError (via _preempted_shards flag check) and automatically re-dispatched to the next active shard.
Training cycle under RLix
DO_TIME_SHARING = os.environ.get("RLIX_CONTROL_PLANE") == "rlix"
# After each training step:
if DO_TIME_SHARING:
build_cpu_weight_cache(step) # snapshot weights to CPU
offload_training_gpu() # free GPU VRAM
destroy_nccl_groups() # free NCCL communicator buffers
hooks.after_training(step) # notify RLix → triggers expand + selective sync
else:
refit_policy_generation(...) # existing standalone path (unchanged)
Proposed NeMo RL Changes
Ownership overview
| Location |
Lines |
Notes |
| NeMo RL repo (detailed below) |
~490 |
All RLix-gated or additive |
| RLix repo (pipeline adapter, model update service, config bridge) |
~720 |
Does not affect NeMo RL |
Supporting API additions (~40 lines, 3 files)
Small additive changes, not RLix-gated:
vllm_worker.py (+5): Parameterize hardcoded sleep(level=1) → sleep(level=self._sleep_level) from config. Level 2 offloads weights + KV cache (needed for training to fit).
vllm_worker_async.py (+15): Same sleep level parameterization. New abort_all_requests() method — gets running request IDs from the engine internally and aborts them. New is_idle() -> bool — checks existing vllm:num_requests_running metric (already scraped at line 241). No request ID plumbing needed from callers.
worker_groups.py (+20): New run_on_dp_shard_leaders(dp_ranks, method, ...) — executes a method on the DP-leader worker of each specified rank. Subset variant of existing run_all_workers_single_data(..., run_rank_0_only_axes=...).
vllm_generation.py — Partial sleep/wake + routing + preemption (+150 lines)
New methods on VLLMGeneration: sleep_partial(dp_ranks), wake_up_partial(dp_ranks), activate_dp_ranks(), mark_dp_ranks_inactive().
New state: _active_dp_ranks: Set[int] (canonical routing set), _preempted_shards: Set[int] (abort window — used for error classification).
Modified _async_generate_base(): round-robin skips sleeping shards, blocks when all shards sleeping (does not raise), converts errors on preempted shards to ShardPreemptedError with automatic re-dispatch retry. No per-request tracking needed — drain uses engine-level idle check.
RLix-gated: Partially. New methods only called from RLix. Routing skip only activates when _active_dp_ranks != all_ranks (never happens in standalone).
megatron_policy_worker.py — CPU buffer cache (+60 lines)
New build_cpu_buffer_cache(step): all TP/PP/CP/EP ranks participate in collective gather, only cache owner (pp0/dp0/tp0/cp0) stores the complete model as CPU tensor buffers. Reuses existing gather_all_hf_weights with EP-aware gather.
RLix-gated: Yes.
nccl_offload.py (new) — Megatron NCCL destroy/re-init (+90 lines)
destroy_megatron_nccl_groups(): collects all NCCL process groups from megatron.core.parallel_state, filters to NCCL backend (excluding Gloo), deduplicates handles, and destroys each via torch.distributed.destroy_process_group(). reinit_megatron_nccl_groups(): re-initializes with saved parallel config. Manual approach because destroy_model_parallel() is not designed for repeated use in long-lived workers.
Why: Long-lived training actors keep NCCL communicator buffers on GPU. Without explicit cleanup, inference wake_up() hits OOM.
RLix-gated: Yes. Repeated destroy/re-init cycles in long-lived workers need validation for VRAM leaks and correctness.
grpo.py — Training loop RLix branches (+60 lines)
DO_TIME_SHARING flag adds if/else branches in async_grpo_train() (see "Training cycle under RLix" above). Also skips ray.shutdown() (RLix manages Ray lifecycle) and prepare_for_generation() / finish_generation() (RLix drives sleep/wake).
RLix-gated: Yes — all branches behind DO_TIME_SHARING.
async_utils.py — Progress reporting (+60 lines)
AsyncTrajectoryCollector accepts optional rlix_hooks (defaults to NoOpRLixHooks()). After each trajectory push, reports generation demand as a point-in-time snapshot with 2% granularity. ReplayBuffer gains valid_count(current_weight_version, max_age_steps) -> int for progress calculation.
Why: RLix uses generation demand progress for planning when to reclaim GPUs for training.
RLix-gated: Partially. Hooks are no-ops in standalone. valid_count() is a pure query method.
rlix_hooks.py (new) — Hook protocol (+30 lines)
4-method protocol (before_training, after_training, report_progress, clear_progress) + NoOpRLixHooks default. Standalone file because both NeMo RL and RLix code import it.
Non-Goals
- Standalone behavior — All changes behind
DO_TIME_SHARING or NoOpRLixHooks. Standalone async_grpo_train() is identical to current code.
- Synchronous
grpo_train() — Partial overlap provides no value for sync training.
- NeMo-Gym — HTTP-based shard preemption requires a different mechanism (503 middleware on NeMo RL's own FastAPI app). Planned as future work.
- Existing transports — ZMQ IPC and NCCL broadcast used as-is. No modifications.
- Existing
refit_policy_generation() — Unchanged. Skipped in RLix mode, works normally in standalone.
- SGLang, DTensor/FSDP2, Multi-LoRA, DPO/SFT — Out of scope.
Validation gates (2-GPU environment)
| Gate |
What it validates |
| 1: Single NeMo RL pipeline |
Partial overlap e2e: sleep/wake, routing, NCCL lifecycle, selective sync |
| 2: Two NeMo RL pipelines |
Multi-pipeline GPU time-sharing |
| 3: NeMo RL + ROLL mixed |
Cross-framework scheduling |
Goal
Pack more RL training jobs into the same GPU cluster. In our ROLL integration, RLix achieved ~3x rollout throughput gain on SWE-agent RL training by time-sharing GPUs across pipelines instead of exclusive allocation.
Summary
RLix is a Ray-based multi-pipeline GPU orchestration system. Instead of each researcher waiting for an exclusive GPU allocation, RLix lets multiple RL training pipelines share the same GPUs. More experiments run concurrently, less time spent waiting for resources. The scheduler dynamically assigns GPU time across pipelines based on priority and demand. Currently RLix supports ROLL as the training framework; this RFC proposes adding NeMo RL as a second supported framework.
We propose changes to NeMo RL that enable its async GRPO training to participate in RLix multi-pipeline scheduling. The key capabilities needed:
One mechanism this enables is partial overlap, where inference GPUs are a superset of training GPUs — overlap GPUs alternate between roles via vLLM
sleep(level=2)/wake_up()while non-overlap GPUs continue inference.Scope: vLLM inference + Megatron training + async GRPO only. All changes are gated behind
RLIX_CONTROL_PLANE=rlixenv var — standalone NeMo RL behavior is unaffected.Review Asks
VLLMGeneration,RayWorkerGroup, andMegatronPolicyWorkerreasonable?RLixHooksprotocol (4 methods:before/after_trainingcalled fromasync_grpo_train(),report/clear_progresscalled fromAsyncTrajectoryCollector) an acceptable integration pattern?Design Overview
Why CPU weight cache + selective sync (not native refit)
In a multi-pipeline setting, each pipeline's training and inference workers share GPU time with other pipelines. Two problems make NeMo RL's native
refit_policy_generation()insufficient:Full broadcast causes OOM.
refit_policy_generation()broadcasts weights to all inference workers via NCCL collective, requiring the full model staged on GPU while receivers already have their weights loaded. With multiple pipelines sharing GPUs, this is not affordable — each pipeline must sync only to its own shards that actually need updating, not all workers.GPU VRAM conflict. Refit requires training weights on GPU when sending (CUDA IPC handles or NCCL broadcast). But when inference workers need to wake up on the same GPUs, both cannot coexist → OOM.
Solution: CPU weight cache + selective sync. After each training step, snapshot weights to CPU, release the GPU. Sync only to the specific DP shards that need it, in chunks (staged CPU → GPU, freed after send) to control peak VRAM. Other shards and other pipelines are unaffected. The routing layer reuses NeMo RL's existing ZMQ IPC and NCCL broadcast transport — no modifications to transport code.
Abort-drain-sleep for shard preemption
When RLix reclaims GPUs from inference, we cannot simply call
sleep()on workers with active requests — vLLM would crash accessing offloaded GPU memory. The safe sequence:engine.abort()for all running requests internally (no request ID plumbing needed from the generation layer)vllm:num_requests_runninguntil 0 (already tracked per worker atvllm_worker_async.py:241). Drain timeout with force-sleep fallback in case the metric never reaches 0.Callers with in-flight requests on preempted shards receive errors, classified as
ShardPreemptedError(via_preempted_shardsflag check) and automatically re-dispatched to the next active shard.Training cycle under RLix
Proposed NeMo RL Changes
Ownership overview
Supporting API additions (~40 lines, 3 files)
Small additive changes, not RLix-gated:
vllm_worker.py(+5): Parameterize hardcodedsleep(level=1)→sleep(level=self._sleep_level)from config. Level 2 offloads weights + KV cache (needed for training to fit).vllm_worker_async.py(+15): Same sleep level parameterization. Newabort_all_requests()method — gets running request IDs from the engine internally and aborts them. Newis_idle() -> bool— checks existingvllm:num_requests_runningmetric (already scraped at line 241). No request ID plumbing needed from callers.worker_groups.py(+20): Newrun_on_dp_shard_leaders(dp_ranks, method, ...)— executes a method on the DP-leader worker of each specified rank. Subset variant of existingrun_all_workers_single_data(..., run_rank_0_only_axes=...).vllm_generation.py— Partial sleep/wake + routing + preemption (+150 lines)New methods on
VLLMGeneration:sleep_partial(dp_ranks),wake_up_partial(dp_ranks),activate_dp_ranks(),mark_dp_ranks_inactive().New state:
_active_dp_ranks: Set[int](canonical routing set),_preempted_shards: Set[int](abort window — used for error classification).Modified
_async_generate_base(): round-robin skips sleeping shards, blocks when all shards sleeping (does not raise), converts errors on preempted shards toShardPreemptedErrorwith automatic re-dispatch retry. No per-request tracking needed — drain uses engine-level idle check.RLix-gated: Partially. New methods only called from RLix. Routing skip only activates when
_active_dp_ranks != all_ranks(never happens in standalone).megatron_policy_worker.py— CPU buffer cache (+60 lines)New
build_cpu_buffer_cache(step): all TP/PP/CP/EP ranks participate in collective gather, only cache owner (pp0/dp0/tp0/cp0) stores the complete model as CPU tensor buffers. Reuses existinggather_all_hf_weightswith EP-aware gather.RLix-gated: Yes.
nccl_offload.py(new) — Megatron NCCL destroy/re-init (+90 lines)destroy_megatron_nccl_groups(): collects all NCCL process groups frommegatron.core.parallel_state, filters to NCCL backend (excluding Gloo), deduplicates handles, and destroys each viatorch.distributed.destroy_process_group().reinit_megatron_nccl_groups(): re-initializes with saved parallel config. Manual approach becausedestroy_model_parallel()is not designed for repeated use in long-lived workers.Why: Long-lived training actors keep NCCL communicator buffers on GPU. Without explicit cleanup, inference
wake_up()hits OOM.RLix-gated: Yes. Repeated destroy/re-init cycles in long-lived workers need validation for VRAM leaks and correctness.
grpo.py— Training loop RLix branches (+60 lines)DO_TIME_SHARINGflag addsif/elsebranches inasync_grpo_train()(see "Training cycle under RLix" above). Also skipsray.shutdown()(RLix manages Ray lifecycle) andprepare_for_generation()/finish_generation()(RLix drives sleep/wake).RLix-gated: Yes — all branches behind
DO_TIME_SHARING.async_utils.py— Progress reporting (+60 lines)AsyncTrajectoryCollectoraccepts optionalrlix_hooks(defaults toNoOpRLixHooks()). After each trajectory push, reports generation demand as a point-in-time snapshot with 2% granularity.ReplayBuffergainsvalid_count(current_weight_version, max_age_steps) -> intfor progress calculation.Why: RLix uses generation demand progress for planning when to reclaim GPUs for training.
RLix-gated: Partially. Hooks are no-ops in standalone.
valid_count()is a pure query method.rlix_hooks.py(new) — Hook protocol (+30 lines)4-method protocol (
before_training,after_training,report_progress,clear_progress) +NoOpRLixHooksdefault. Standalone file because both NeMo RL and RLix code import it.Non-Goals
DO_TIME_SHARINGorNoOpRLixHooks. Standaloneasync_grpo_train()is identical to current code.grpo_train()— Partial overlap provides no value for sync training.refit_policy_generation()— Unchanged. Skipped in RLix mode, works normally in standalone.Validation gates (2-GPU environment)