RFC: GPU Time-Sharing Support for NeMo RL Async GRPO

## Goal

Pack more RL training jobs into the same GPU cluster. In our [ROLL](https://github.com/alibaba/ROLL) integration, [RLix](https://github.com/rlops/rlix) achieved **~3x rollout throughput gain** on SWE-agent RL training by time-sharing GPUs across pipelines instead of exclusive allocation.

## Summary

**RLix** is a Ray-based multi-pipeline GPU orchestration system. Instead of each researcher waiting for an exclusive GPU allocation, RLix lets multiple RL training pipelines share the same GPUs. More experiments run concurrently, less time spent waiting for resources. The scheduler dynamically assigns GPU time across pipelines based on priority and demand. Currently RLix supports ROLL as the training framework; this RFC proposes adding NeMo RL as a second supported framework.

We propose changes to NeMo RL that enable its async GRPO training to participate in RLix multi-pipeline scheduling. The key capabilities needed:

- **Scheduler-driven shard resize** — the scheduler controls when individual inference DP shards sleep/wake, rather than the training loop doing all-or-nothing sleep/wake
- **CPU weight caching + selective sync** — after training, weights are cached on CPU and synced only to the specific shards that need updating. In a multi-pipeline setting, we cannot afford to broadcast weights to all vLLM workers across all pipelines — each pipeline must sync independently to its own shards without blocking others
- **Progress reporting** — pipelines report generation demand so RLix can make informed allocation decisions

One mechanism this enables is **partial overlap**, where inference GPUs are a superset of training GPUs — overlap GPUs alternate between roles via vLLM `sleep(level=2)` / `wake_up()` while non-overlap GPUs continue inference.

**Scope:** vLLM inference + Megatron training + async GRPO only. All changes are gated behind `RLIX_CONTROL_PLANE=rlix` env var — standalone NeMo RL behavior is unaffected.

---

## Review Asks

1. **API shape** — Are the proposed new methods on `VLLMGeneration`, `RayWorkerGroup`, and `MegatronPolicyWorker` reasonable?
2. **Lifecycle hooks** — Is the `RLixHooks` protocol (4 methods: `before/after_training` called from `async_grpo_train()`, `report/clear_progress` called from `AsyncTrajectoryCollector`) an acceptable integration pattern?

---

## Design Overview

### Why CPU weight cache + selective sync (not native refit)

In a multi-pipeline setting, each pipeline's training and inference workers share GPU time with other pipelines. Two problems make NeMo RL's native `refit_policy_generation()` insufficient:

1. **Full broadcast causes OOM.** `refit_policy_generation()` broadcasts weights to all inference workers via NCCL collective, requiring the full model staged on GPU while receivers already have their weights loaded. With multiple pipelines sharing GPUs, this is not affordable — each pipeline must sync only to its own shards that actually need updating, not all workers.

2. **GPU VRAM conflict.** Refit requires training weights on GPU when sending (CUDA IPC handles or NCCL broadcast). But when inference workers need to wake up on the same GPUs, both cannot coexist → OOM.

**Solution: CPU weight cache + selective sync.** After each training step, snapshot weights to CPU, release the GPU. Sync only to the specific DP shards that need it, in chunks (staged CPU → GPU, freed after send) to control peak VRAM. Other shards and other pipelines are unaffected. The routing layer reuses NeMo RL's existing ZMQ IPC and NCCL broadcast transport — no modifications to transport code.

### Abort-drain-sleep for shard preemption

When RLix reclaims GPUs from inference, we cannot simply call `sleep()` on workers with active requests — vLLM would crash accessing offloaded GPU memory. The safe sequence:

1. **Mark preempted** — block new request dispatch to these shards
2. **Abort** — worker calls `engine.abort()` for all running requests internally (no request ID plumbing needed from the generation layer)
3. **Drain** — poll existing vLLM engine metric `vllm:num_requests_running` until 0 (already tracked per worker at `vllm_worker_async.py:241`). Drain timeout with force-sleep fallback in case the metric never reaches 0.
4. **Sleep** — engine confirmed idle, GPU memory safely offloaded

Callers with in-flight requests on preempted shards receive errors, classified as `ShardPreemptedError` (via `_preempted_shards` flag check) and automatically re-dispatched to the next active shard.

### Training cycle under RLix

```python
DO_TIME_SHARING = os.environ.get("RLIX_CONTROL_PLANE") == "rlix"

# After each training step:
if DO_TIME_SHARING:
    build_cpu_weight_cache(step)    # snapshot weights to CPU
    offload_training_gpu()          # free GPU VRAM
    destroy_nccl_groups()           # free NCCL communicator buffers
    hooks.after_training(step)      # notify RLix → triggers expand + selective sync
else:
    refit_policy_generation(...)    # existing standalone path (unchanged)
```

---

## Proposed NeMo RL Changes

### Ownership overview

| Location | Lines | Notes |
|----------|-------|-------|
| NeMo RL repo (detailed below) | ~490 | All RLix-gated or additive |
| RLix repo (pipeline adapter, model update service, config bridge) | ~720 | Does not affect NeMo RL |

### Supporting API additions (~40 lines, 3 files)

Small additive changes, not RLix-gated:

- **`vllm_worker.py`** (+5): Parameterize hardcoded `sleep(level=1)` → `sleep(level=self._sleep_level)` from config. Level 2 offloads weights + KV cache (needed for training to fit).
- **`vllm_worker_async.py`** (+15): Same sleep level parameterization. New `abort_all_requests()` method — gets running request IDs from the engine internally and aborts them. New `is_idle() -> bool` — checks existing `vllm:num_requests_running` metric (already scraped at line 241). No request ID plumbing needed from callers.
- **`worker_groups.py`** (+20): New `run_on_dp_shard_leaders(dp_ranks, method, ...)` — executes a method on the DP-leader worker of each specified rank. Subset variant of existing `run_all_workers_single_data(..., run_rank_0_only_axes=...)`.

### `vllm_generation.py` — Partial sleep/wake + routing + preemption (+150 lines)

New methods on `VLLMGeneration`: `sleep_partial(dp_ranks)`, `wake_up_partial(dp_ranks)`, `activate_dp_ranks()`, `mark_dp_ranks_inactive()`.

New state: `_active_dp_ranks: Set[int]` (canonical routing set), `_preempted_shards: Set[int]` (abort window — used for error classification).

Modified `_async_generate_base()`: round-robin skips sleeping shards, blocks when all shards sleeping (does not raise), converts errors on preempted shards to `ShardPreemptedError` with automatic re-dispatch retry. No per-request tracking needed — drain uses engine-level idle check.

**RLix-gated:** Partially. New methods only called from RLix. Routing skip only activates when `_active_dp_ranks != all_ranks` (never happens in standalone).

### `megatron_policy_worker.py` — CPU buffer cache (+60 lines)

New `build_cpu_buffer_cache(step)`: all TP/PP/CP/EP ranks participate in collective gather, only cache owner (pp0/dp0/tp0/cp0) stores the complete model as CPU tensor buffers. Reuses existing `gather_all_hf_weights` with EP-aware gather.

**RLix-gated:** Yes.

### `nccl_offload.py` (new) — Megatron NCCL destroy/re-init (+90 lines)

`destroy_megatron_nccl_groups()`: collects all NCCL process groups from `megatron.core.parallel_state`, filters to NCCL backend (excluding Gloo), deduplicates handles, and destroys each via `torch.distributed.destroy_process_group()`. `reinit_megatron_nccl_groups()`: re-initializes with saved parallel config. Manual approach because `destroy_model_parallel()` is not designed for repeated use in long-lived workers.

**Why:** Long-lived training actors keep NCCL communicator buffers on GPU. Without explicit cleanup, inference `wake_up()` hits OOM.

**RLix-gated:** Yes. Repeated destroy/re-init cycles in long-lived workers need validation for VRAM leaks and correctness.

### `grpo.py` — Training loop RLix branches (+60 lines)

`DO_TIME_SHARING` flag adds `if/else` branches in `async_grpo_train()` (see "Training cycle under RLix" above). Also skips `ray.shutdown()` (RLix manages Ray lifecycle) and `prepare_for_generation()` / `finish_generation()` (RLix drives sleep/wake).

**RLix-gated:** Yes — all branches behind `DO_TIME_SHARING`.

### `async_utils.py` — Progress reporting (+60 lines)

`AsyncTrajectoryCollector` accepts optional `rlix_hooks` (defaults to `NoOpRLixHooks()`). After each trajectory push, reports generation demand as a point-in-time snapshot with 2% granularity. `ReplayBuffer` gains `valid_count(current_weight_version, max_age_steps) -> int` for progress calculation.

**Why:** RLix uses generation demand progress for planning when to reclaim GPUs for training.

**RLix-gated:** Partially. Hooks are no-ops in standalone. `valid_count()` is a pure query method.

### `rlix_hooks.py` (new) — Hook protocol (+30 lines)

4-method protocol (`before_training`, `after_training`, `report_progress`, `clear_progress`) + `NoOpRLixHooks` default. Standalone file because both NeMo RL and RLix code import it.

---

## Non-Goals

- **Standalone behavior** — All changes behind `DO_TIME_SHARING` or `NoOpRLixHooks`. Standalone `async_grpo_train()` is identical to current code.
- **Synchronous `grpo_train()`** — Partial overlap provides no value for sync training.
- **NeMo-Gym** — HTTP-based shard preemption requires a different mechanism (503 middleware on NeMo RL's own FastAPI app). Planned as future work.
- **Existing transports** — ZMQ IPC and NCCL broadcast used as-is. No modifications.
- **Existing `refit_policy_generation()`** — Unchanged. Skipped in RLix mode, works normally in standalone.
- **SGLang, DTensor/FSDP2, Multi-LoRA, DPO/SFT** — Out of scope.

---

## Validation gates (2-GPU environment)

| Gate | What it validates |
|------|------------------|
| 1: Single NeMo RL pipeline | Partial overlap e2e: sleep/wake, routing, NCCL lifecycle, selective sync |
| 2: Two NeMo RL pipelines | Multi-pipeline GPU time-sharing |
| 3: NeMo RL + ROLL mixed | Cross-framework scheduling |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: GPU Time-Sharing Support for NeMo RL Async GRPO #2244

Goal

Summary

Review Asks

Design Overview

Why CPU weight cache + selective sync (not native refit)

Abort-drain-sleep for shard preemption

Training cycle under RLix

Proposed NeMo RL Changes

Ownership overview

Supporting API additions (~40 lines, 3 files)

`vllm_generation.py` — Partial sleep/wake + routing + preemption (+150 lines)

`megatron_policy_worker.py` — CPU buffer cache (+60 lines)

`nccl_offload.py` (new) — Megatron NCCL destroy/re-init (+90 lines)

`grpo.py` — Training loop RLix branches (+60 lines)

`async_utils.py` — Progress reporting (+60 lines)

`rlix_hooks.py` (new) — Hook protocol (+30 lines)

Non-Goals

Validation gates (2-GPU environment)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Location	Lines	Notes
NeMo RL repo (detailed below)	~490	All RLix-gated or additive
RLix repo (pipeline adapter, model update service, config bridge)	~720	Does not affect NeMo RL

Gate	What it validates
1: Single NeMo RL pipeline	Partial overlap e2e: sleep/wake, routing, NCCL lifecycle, selective sync
2: Two NeMo RL pipelines	Multi-pipeline GPU time-sharing
3: NeMo RL + ROLL mixed	Cross-framework scheduling

RFC: GPU Time-Sharing Support for NeMo RL Async GRPO #2244

Description

Goal

Summary

Review Asks

Design Overview

Why CPU weight cache + selective sync (not native refit)

Abort-drain-sleep for shard preemption

Training cycle under RLix

Proposed NeMo RL Changes

Ownership overview

Supporting API additions (~40 lines, 3 files)

vllm_generation.py — Partial sleep/wake + routing + preemption (+150 lines)

megatron_policy_worker.py — CPU buffer cache (+60 lines)

nccl_offload.py (new) — Megatron NCCL destroy/re-init (+90 lines)

grpo.py — Training loop RLix branches (+60 lines)

async_utils.py — Progress reporting (+60 lines)

rlix_hooks.py (new) — Hook protocol (+30 lines)

Non-Goals

Validation gates (2-GPU environment)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`vllm_generation.py` — Partial sleep/wake + routing + preemption (+150 lines)

`megatron_policy_worker.py` — CPU buffer cache (+60 lines)

`nccl_offload.py` (new) — Megatron NCCL destroy/re-init (+90 lines)

`grpo.py` — Training loop RLix branches (+60 lines)

`async_utils.py` — Progress reporting (+60 lines)

`rlix_hooks.py` (new) — Hook protocol (+30 lines)