Commit 9f15e77
[2.7] Pass-Through: Zero Tensor Copy at CJ for Large-Model Federated Training (#4210)
This PR introduces the **pass-through architecture** for
`ClientAPILauncherExecutor`, eliminating tensor materialisation at the
CJ (Client Job) process when large models are exchanged between the FL
server and a subprocess agent.
In large-model federated learning (e.g., 7B–70B LLM fine-tuning), the CJ
process today acts as a blind relay that fully deserializes and
re-serializes every tensor it receives from the FL server before
forwarding to the subprocess. For a 70B float16 model, this consumes
~140 GB of CJ memory and requires two complete network transfers. B1
pass-through removes both costs.
---
NVFlare's multi-hop execution path for `launch_external_process=True`
looks like:
```
FL Server ──serialize──▶ CJ process ──re-serialize──▶ Subprocess agent
```
Each tensor in the global model is handled as follows at CJ:
1. **Server** serializes the model and creates a download transaction
(tensor data lives on the server).
2. **CJ** fully *downloads* every tensor from the server into its own
heap, materialising the complete model in CJ memory.
3. **CJ** re-serializes the model for the subprocess, creating a *new*
download transaction — the subprocess then downloads from CJ.
For large models, this means:
- **CJ peak memory = full model size** (potentially 100s of GB).
- **Two full network transfers**: server → CJ, then CJ → subprocess.
- **CJ becomes a throughput bottleneck** and an OOM risk for any model
that doesn't fit in the CJ process's memory.
This is the reason why workflows are infeasible for large models beyond
what the CJ machine can hold.
---
With `FOBSContextKey.PASS_THROUGH` enabled on CJ's cell FOBS context,
the data path becomes:
```
FL Server ──stream──▶ CJ (LazyDownloadRef only, no tensor data)
└──forward ref──▶ Subprocess
└──download──▶ FL Server
```
CJ holds **only lightweight placeholders** (< 100 bytes per tensor). The
subprocess downloads each tensor directly from the FL server — CJ is
never involved in the tensor data path.
(`nvflare/fuel/utils/fobs/__init__.py`)
A new context key that signals ViaDownloaderDecomposer to skip the
download step and create lazy placeholders instead.
A small sentinel object (four fields: `fqcn`, `ref_id`, `item_id`,
`dot`) created by `recompose()` in PASS_THROUGH mode. It carries the
original FL server's FQCN, batch ref_id, intra-batch item ID, and Datum
Object Type — everything the subprocess needs to download the tensor
directly.
A named sentinel stored in `fobs_ctx[items_key]` during PASS_THROUGH
receive. Using a typed class (rather than a plain tuple) makes the
PASS_THROUGH branch unambiguous and immune to accidental type collisions
with real item dicts.
A new auto-registered FOBS decomposer for `LazyDownloadRef`. When CJ
re-serializes a task containing `LazyDownloadRef` objects:
- **`decompose()`** delegates to `get_dot_handler(lazy.dot)` — the
original `ViaDownloaderDecomposer` subclass (e.g., `TensorDecomposer`,
`NumpyArrayDecomposer`). That handler's `_finalize_lazy_batch`
post-callback re-emits the *original* server datum (fqcn + ref_id + DOT)
so the subprocess knows exactly where to download from. `lazy_dot` is
appended to the encoding dict for routing on the receive side.
- **`recompose()`** uses `lazy_dot` to look up the handler and delegates
to `handler.recompose()`, which retrieves the real tensor from
`fobs_ctx[handler.items_key]` (populated by `process_datum()` when the
subprocess received the forwarded datum).
The `dot` (Datum Object Type) field on both `LazyDownloadRef` and
`_LazyBatchInfo` ensures that numpy arrays stay with
`NumpyArrayDecomposer` and PyTorch tensors stay with `TensorDecomposer`,
preserving type safety through the full pass-through hop.
(`client_api_launcher_executor.py`)
On startup, the executor enables PASS_THROUGH on the engine cell's FOBS
context:
```python
cell.core_cell.update_fobs_context({FOBSContextKey.PASS_THROUGH: True})
```
This single line activates the full B1 architecture for every job that
uses `launch_external_process=True` — including `llm_hf` and any recipe
that calls `ScriptRunner(launch_external_process=True)`.
---
The pipe (CellPipe) operates on already-serialized bytes. Intercepting
at the pipe level would require parsing FOBS binary format, re-writing
datum references, and re-assembling the byte stream — fragile and
tightly coupled to the wire format.
Intercepting at the FOBS decomposer level is the natural extension
point: decomposers already control exactly when and how data is
materialised. PASS_THROUGH simply adds a "don't materialise" branch to
that existing mechanism.
The subprocess must know *which* `ViaDownloaderDecomposer` subclass owns
the downloaded data so it can store it in the correct
`fobs_ctx[items_key]` and route `recompose()` correctly. The `dot`
field, set when the server originally serialized the tensor, carries
this type information through the pass-through hop without any
type-switching logic.
| Stage | Before (tensor materialised) | After (B1 pass-through) |
|-------|------------------------------|------------------------|
| CJ receive | Full model size (e.g., 140 GB) | ~100 bytes per tensor |
| CJ forward | Creates new download tx | Re-emits original server datum
|
| Subprocess receive | Downloads from CJ | Downloads directly from FL
server |
---
1. **Zero tensor copy at CJ** — CJ memory footprint is independent of
model size.
2. **One network transfer** instead of two — tensors travel server →
subprocess directly.
3. **No CJ OOM risk** for large models regardless of CJ machine memory
capacity.
4. **Transparent to job authors** — no changes to job configs, training
scripts, or recipe APIs; `launch_external_process=True` automatically
activates B1.
5. **Type-safe** — `dot` propagation preserves tensor type (numpy /
pytorch) through the hop without any if/elif type switching.
---
- All existing jobs using `launch_external_process=True` automatically
benefit. No config or script changes required.
- Jobs using `launch_external_process=False` (in-process executor) are
completely unaffected — `ClientAPILauncherExecutor.initialize()` is not
called.
- For models smaller than the ViaDownloaderDecomposer streaming
threshold (2 MB per array), FOBS uses native (inline) serialization
regardless of `PASS_THROUGH` — behaviour is identical to before.
- `LazyDownloadRefDecomposer` is auto-registered via the existing
`register_folder` mechanism; no explicit registration call is needed by
any caller.
---
| File | Change |
|------|--------|
| `nvflare/fuel/utils/fobs/__init__.py` | Add
`FOBSContextKey.PASS_THROUGH` |
| `nvflare/fuel/utils/fobs/decomposers/via_downloader.py` | Add
`LazyDownloadRef`, `_LazyBatchInfo`, PASS_THROUGH branches in
`process_datum()` / `recompose()`, `LazyDownloadRefDecomposer`,
`_finalize_lazy_batch` post-callback |
| `nvflare/app_common/executors/client_api_launcher_executor.py` |
`initialize()` enables PASS_THROUGH on engine cell |
---
(22 tests)
| Test class | What is verified |
|------------|-----------------|
| `TestLazyDownloadRef` | Construction, `__slots__`, per-item
distinctness |
| `TestLazyBatchInfo` | Construction, `__slots__`, `isinstance`
reliability vs plain tuple |
| `TestProcessDatumPassThrough` | PASS_THROUGH stores `_LazyBatchInfo`,
never calls `_download_from_remote_cell`; normal mode calls download |
| `TestRecomposePassThrough` | Returns `LazyDownloadRef` with correct
`fqcn`, `ref_id`, `item_id` from `_LazyBatchInfo` |
| `TestDecomposeWithLazyDownloadRef` | Returns REF encoding;
`_finalize_lazy_batch` post-CB registered once per batch regardless of
item count; emitted datum has correct fqcn/ref_id/DOT |
| `TestNoMemoryAccumulation` | `_CtxKey.OBJECTS` absent after
PASS_THROUGH (no download transaction opened);
`DownloadService._tx_table` unchanged; 50-cycle repeat produces no state
bleed |
`tests/unit_test/fuel/f3/streaming/test_pass_through_e2e.py` (5 tests,
real TCP Cells)
| Test | What is verified |
|------|-----------------|
| `test_arrays_survive_pass_through_hop` | Full round-trip: server → CJ
(PASS_THROUGH) → subprocess, arrays arrive bit-exact |
| `test_cj_holds_only_lazy_refs_not_tensor_data` | After PASS_THROUGH
deserialization, CJ holds only `LazyDownloadRef`, never `np.ndarray` |
| `test_cj_creates_no_download_transaction` |
`DownloadService._tx_table` is unchanged during PASS_THROUGH +
re-serialization |
| `test_forwarded_payload_carries_original_server_ref` | Forwarded datum
contains original server `fqcn` and `ref_id` — subprocess downloads from
server, not CJ |
| `test_multiple_array_roundtrip` | 8-array batch all survive with
bit-exact values |
`tests/integration_test/data/jobs/pt_large_model_pass_through/`
Full-stack integration test using `PTClientAPILauncherExecutor` with
`launch_once=True` (the pattern used by `llm_hf`):
- **Model**: `LargeNet` — 3-layer MLP with ~8 MB of float32 parameters,
well above the 2 MB ViaDownloaderDecomposer streaming threshold. This
forces the real B1 code path (streaming + PASS_THROUGH) rather than the
native inline path used by small models.
- **Client script**: Mirrors `llm_hf/client.py` structure (`while
flare.is_running()` loop, receive / train / send). Uses CPU-only
synthetic data — no dataset download required in CI.
- **Added to** `client_api.yml` as `"run pt-large-model-pass-through"`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>1 parent aaf5b13 commit 9f15e77
File tree
14 files changed
+1387
-6
lines changed- docs/programming_guide
- nvflare
- app_common/executors
- fuel/utils/fobs
- decomposers
- tests
- integration_test/data
- jobs/pt_large_model_pass_through
- app
- config
- custom
- test_configs/standalone_job
- unit_test
- app_common/executors
- fuel
- f3/streaming
- utils/fobs
14 files changed
+1387
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
183 | 183 | | |
184 | 184 | | |
185 | 185 | | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
150 | 190 | | |
151 | 191 | | |
152 | 192 | | |
| |||
Lines changed: 47 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
25 | 26 | | |
26 | 27 | | |
27 | 28 | | |
| |||
107 | 108 | | |
108 | 109 | | |
109 | 110 | | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
110 | 115 | | |
111 | 116 | | |
112 | 117 | | |
113 | | - | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
114 | 143 | | |
115 | 144 | | |
116 | 145 | | |
| |||
126 | 155 | | |
127 | 156 | | |
128 | 157 | | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
129 | 175 | | |
130 | 176 | | |
131 | 177 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
34 | 93 | | |
35 | 94 | | |
36 | 95 | | |
| |||
178 | 237 | | |
179 | 238 | | |
180 | 239 | | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
181 | 262 | | |
182 | 263 | | |
183 | 264 | | |
| |||
320 | 401 | | |
321 | 402 | | |
322 | 403 | | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
323 | 424 | | |
324 | 425 | | |
325 | 426 | | |
| |||
340 | 441 | | |
341 | 442 | | |
342 | 443 | | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
343 | 455 | | |
344 | 456 | | |
345 | 457 | | |
| |||
377 | 489 | | |
378 | 490 | | |
379 | 491 | | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
380 | 505 | | |
381 | 506 | | |
382 | 507 | | |
| |||
431 | 556 | | |
432 | 557 | | |
433 | 558 | | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
0 commit comments