Cherry-pick [2.7] Pass-Through: Zero Tensor Copy at CJ for Large-Model Federated Training (#4210)#4289
Cherry-pick [2.7] Pass-Through: Zero Tensor Copy at CJ for Large-Model Federated Training (#4210)#4289YuanTingHsieh wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR introduces the B1 pass-through architecture for Key observations:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Server as FL Server
participant CJ as CJ Process<br/>(PASS_THROUGH=True)
participant Sub as Subprocess Agent
Note over Server: dump_to_bytes(model)<br/>creates ObjectDownloader tx<br/>ref_id = UUID, fqcn = server
Server->>CJ: serialized bytes<br/>(datum: {fqcn, ref_id})
Note over CJ: process_datum() detects PASS_THROUGH<br/>→ stores _LazyBatchInfo(fqcn, ref_id)<br/>→ NO download call
Note over CJ: recompose() returns<br/>LazyDownloadRef(fqcn, ref_id, item_id)<br/>for each tensor item
Note over CJ: LazyDownloadRefDecomposer.decompose()<br/>→ re-emits original server datum<br/>via _finalize_lazy_batch post-CB<br/>→ NO new ObjectDownloader tx
CJ->>Sub: forwarded bytes<br/>(datum: original server {fqcn, ref_id})
Note over Sub: process_datum() NOT in PASS_THROUGH<br/>→ calls _download_from_remote_cell
Sub->>Server: download tensors directly<br/>(ref_id from original datum)
Server-->>Sub: tensor data
Note over Sub: recompose() returns real tensors<br/>from downloaded items dict
Last reviewed commit: bbb9526 |
| self._memory_gc_rounds = memory_gc_rounds | ||
| self._cuda_empty_cache = cuda_empty_cache |
There was a problem hiding this comment.
Undefined names cause NameError at instantiation
memory_gc_rounds and cuda_empty_cache are referenced in the __init__ body but are not declared as parameters in the method signature. This will raise a NameError every time ClientAPILauncherExecutor(...) is instantiated, completely breaking the class.
Looking at the constructor signature (lines 30-54), neither memory_gc_rounds nor cuda_empty_cache appear, and they are also not assigned anywhere earlier in __init__. The LauncherExecutor parent class does not define them either.
This looks like a cherry-pick artifact: the 2.7 PR (#4210) likely added these parameters to the constructor, but the cherry-pick to main only included the body assignments, not the signature additions. Since these instance variables (self._memory_gc_rounds, self._cuda_empty_cache) are also never referenced anywhere else in this class, the simplest fix is to remove both lines:
| self._memory_gc_rounds = memory_gc_rounds | |
| self._cuda_empty_cache = cuda_empty_cache | |
| self._cell_with_pass_through = None | |
| self._prev_pass_through = None |
The unit test fixture in client_api_launcher_executor_test.py that calls ClientAPILauncherExecutor(pipe_id="test_pipe") will also fail with this NameError before any test logic even runs.
There was a problem hiding this comment.
Pull request overview
This PR backports a “pass-through” serialization mode for external-process execution so the Client Job (CJ) process forwards large model tensors without materializing them in CJ memory, enabling the subprocess agent to download tensor payloads directly from the originating cell (typically the FL server).
Changes:
- Add
FOBSContextKey.PASS_THROUGHand implementLazyDownloadRef/ pass-through branches inViaDownloaderDecomposerplus an auto-registeredLazyDownloadRefDecomposer. - Enable/restore
PASS_THROUGHon the engine cell inClientAPILauncherExecutorlifecycle. - Add unit tests, TCP cell E2E tests, and a full integration job exercising the streaming threshold path; update memory-management docs and
.gitignore.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
nvflare/fuel/utils/fobs/__init__.py |
Adds FOBSContextKey.PASS_THROUGH flag to control pass-through behavior. |
nvflare/fuel/utils/fobs/decomposers/via_downloader.py |
Implements LazyDownloadRef, pass-through processing, and a decomposer to forward original download refs. |
nvflare/app_common/executors/client_api_launcher_executor.py |
Turns on PASS_THROUGH for the engine cell during executor init and restores it on finalize/error. |
tests/unit_test/fuel/utils/fobs/test_pass_through.py |
Unit tests for pass-through logic, lazy refs, and “no download tx created” invariants. |
tests/unit_test/fuel/f3/streaming/test_pass_through_e2e.py |
E2E tests using real TCP Cells to validate the server→CJ→subprocess pass-through hop. |
tests/unit_test/app_common/executors/client_api_launcher_executor_test.py |
Tests that PASS_THROUGH is restored correctly on finalize and init failure. |
tests/integration_test/data/test_configs/standalone_job/client_api.yml |
Adds a standalone integration test entry to run the new pass-through job. |
tests/integration_test/data/jobs/pt_large_model_pass_through/meta.conf |
Defines the integration test job metadata. |
tests/integration_test/data/jobs/pt_large_model_pass_through/app/custom/large_model_train.py |
Client-side training script that exercises external-process pass-through. |
tests/integration_test/data/jobs/pt_large_model_pass_through/app/custom/large_model_net.py |
Defines an ~8MB model to force the streaming/download path. |
tests/integration_test/data/jobs/pt_large_model_pass_through/app/config/config_fed_server.conf |
Server config for the integration job. |
tests/integration_test/data/jobs/pt_large_model_pass_through/app/config/config_fed_client.conf |
Client config using PTClientAPILauncherExecutor with launch_once=True. |
docs/programming_guide/memory_management.rst |
Updates recommended memory settings table and adds jemalloc preload guidance. |
.gitignore |
Ignores memory profiler .dat outputs under tests/memory_profile/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self._memory_gc_rounds = memory_gc_rounds | ||
| self._cuda_empty_cache = cuda_empty_cache |
There was a problem hiding this comment.
__init__ assigns self._memory_gc_rounds = memory_gc_rounds and self._cuda_empty_cache = cuda_empty_cache, but neither memory_gc_rounds nor cuda_empty_cache is defined in this scope (they are not parameters and not module globals). This will raise NameError when instantiating ClientAPILauncherExecutor. Add these as explicit __init__ parameters with defaults (and document/use them), or remove the assignments if they’re not intended for this branch.
| self._memory_gc_rounds = memory_gc_rounds | |
| self._cuda_empty_cache = cuda_empty_cache |
| } | ||
|
|
||
|
|
||
| def _simulate_cj_pass_through(server_bytes: bytes) -> bytes: |
There was a problem hiding this comment.
_simulate_cj_pass_through is annotated/documented as returning bytes, but it actually returns a 2-tuple (cj_result, forwarded_bytes) and all call sites unpack it as such. Update the return type annotation (and docstring) to match the actual return value.
…Training (NVIDIA#4210) This PR introduces the **pass-through architecture** for `ClientAPILauncherExecutor`, eliminating tensor materialisation at the CJ (Client Job) process when large models are exchanged between the FL server and a subprocess agent. In large-model federated learning (e.g., 7B–70B LLM fine-tuning), the CJ process today acts as a blind relay that fully deserializes and re-serializes every tensor it receives from the FL server before forwarding to the subprocess. For a 70B float16 model, this consumes ~140 GB of CJ memory and requires two complete network transfers. B1 pass-through removes both costs. --- NVFlare's multi-hop execution path for `launch_external_process=True` looks like: ``` FL Server ──serialize──▶ CJ process ──re-serialize──▶ Subprocess agent ``` Each tensor in the global model is handled as follows at CJ: 1. **Server** serializes the model and creates a download transaction (tensor data lives on the server). 2. **CJ** fully *downloads* every tensor from the server into its own heap, materialising the complete model in CJ memory. 3. **CJ** re-serializes the model for the subprocess, creating a *new* download transaction — the subprocess then downloads from CJ. For large models, this means: - **CJ peak memory = full model size** (potentially 100s of GB). - **Two full network transfers**: server → CJ, then CJ → subprocess. - **CJ becomes a throughput bottleneck** and an OOM risk for any model that doesn't fit in the CJ process's memory. This is the reason why workflows are infeasible for large models beyond what the CJ machine can hold. --- With `FOBSContextKey.PASS_THROUGH` enabled on CJ's cell FOBS context, the data path becomes: ``` FL Server ──stream──▶ CJ (LazyDownloadRef only, no tensor data) └──forward ref──▶ Subprocess └──download──▶ FL Server ``` CJ holds **only lightweight placeholders** (< 100 bytes per tensor). The subprocess downloads each tensor directly from the FL server — CJ is never involved in the tensor data path. (`nvflare/fuel/utils/fobs/__init__.py`) A new context key that signals ViaDownloaderDecomposer to skip the download step and create lazy placeholders instead. A small sentinel object (four fields: `fqcn`, `ref_id`, `item_id`, `dot`) created by `recompose()` in PASS_THROUGH mode. It carries the original FL server's FQCN, batch ref_id, intra-batch item ID, and Datum Object Type — everything the subprocess needs to download the tensor directly. A named sentinel stored in `fobs_ctx[items_key]` during PASS_THROUGH receive. Using a typed class (rather than a plain tuple) makes the PASS_THROUGH branch unambiguous and immune to accidental type collisions with real item dicts. A new auto-registered FOBS decomposer for `LazyDownloadRef`. When CJ re-serializes a task containing `LazyDownloadRef` objects: - **`decompose()`** delegates to `get_dot_handler(lazy.dot)` — the original `ViaDownloaderDecomposer` subclass (e.g., `TensorDecomposer`, `NumpyArrayDecomposer`). That handler's `_finalize_lazy_batch` post-callback re-emits the *original* server datum (fqcn + ref_id + DOT) so the subprocess knows exactly where to download from. `lazy_dot` is appended to the encoding dict for routing on the receive side. - **`recompose()`** uses `lazy_dot` to look up the handler and delegates to `handler.recompose()`, which retrieves the real tensor from `fobs_ctx[handler.items_key]` (populated by `process_datum()` when the subprocess received the forwarded datum). The `dot` (Datum Object Type) field on both `LazyDownloadRef` and `_LazyBatchInfo` ensures that numpy arrays stay with `NumpyArrayDecomposer` and PyTorch tensors stay with `TensorDecomposer`, preserving type safety through the full pass-through hop. (`client_api_launcher_executor.py`) On startup, the executor enables PASS_THROUGH on the engine cell's FOBS context: ```python cell.core_cell.update_fobs_context({FOBSContextKey.PASS_THROUGH: True}) ``` This single line activates the full B1 architecture for every job that uses `launch_external_process=True` — including `llm_hf` and any recipe that calls `ScriptRunner(launch_external_process=True)`. --- The pipe (CellPipe) operates on already-serialized bytes. Intercepting at the pipe level would require parsing FOBS binary format, re-writing datum references, and re-assembling the byte stream — fragile and tightly coupled to the wire format. Intercepting at the FOBS decomposer level is the natural extension point: decomposers already control exactly when and how data is materialised. PASS_THROUGH simply adds a "don't materialise" branch to that existing mechanism. The subprocess must know *which* `ViaDownloaderDecomposer` subclass owns the downloaded data so it can store it in the correct `fobs_ctx[items_key]` and route `recompose()` correctly. The `dot` field, set when the server originally serialized the tensor, carries this type information through the pass-through hop without any type-switching logic. | Stage | Before (tensor materialised) | After (B1 pass-through) | |-------|------------------------------|------------------------| | CJ receive | Full model size (e.g., 140 GB) | ~100 bytes per tensor | | CJ forward | Creates new download tx | Re-emits original server datum | | Subprocess receive | Downloads from CJ | Downloads directly from FL server | --- 1. **Zero tensor copy at CJ** — CJ memory footprint is independent of model size. 2. **One network transfer** instead of two — tensors travel server → subprocess directly. 3. **No CJ OOM risk** for large models regardless of CJ machine memory capacity. 4. **Transparent to job authors** — no changes to job configs, training scripts, or recipe APIs; `launch_external_process=True` automatically activates B1. 5. **Type-safe** — `dot` propagation preserves tensor type (numpy / pytorch) through the hop without any if/elif type switching. --- - All existing jobs using `launch_external_process=True` automatically benefit. No config or script changes required. - Jobs using `launch_external_process=False` (in-process executor) are completely unaffected — `ClientAPILauncherExecutor.initialize()` is not called. - For models smaller than the ViaDownloaderDecomposer streaming threshold (2 MB per array), FOBS uses native (inline) serialization regardless of `PASS_THROUGH` — behaviour is identical to before. - `LazyDownloadRefDecomposer` is auto-registered via the existing `register_folder` mechanism; no explicit registration call is needed by any caller. --- | File | Change | |------|--------| | `nvflare/fuel/utils/fobs/__init__.py` | Add `FOBSContextKey.PASS_THROUGH` | | `nvflare/fuel/utils/fobs/decomposers/via_downloader.py` | Add `LazyDownloadRef`, `_LazyBatchInfo`, PASS_THROUGH branches in `process_datum()` / `recompose()`, `LazyDownloadRefDecomposer`, `_finalize_lazy_batch` post-callback | | `nvflare/app_common/executors/client_api_launcher_executor.py` | `initialize()` enables PASS_THROUGH on engine cell | --- (22 tests) | Test class | What is verified | |------------|-----------------| | `TestLazyDownloadRef` | Construction, `__slots__`, per-item distinctness | | `TestLazyBatchInfo` | Construction, `__slots__`, `isinstance` reliability vs plain tuple | | `TestProcessDatumPassThrough` | PASS_THROUGH stores `_LazyBatchInfo`, never calls `_download_from_remote_cell`; normal mode calls download | | `TestRecomposePassThrough` | Returns `LazyDownloadRef` with correct `fqcn`, `ref_id`, `item_id` from `_LazyBatchInfo` | | `TestDecomposeWithLazyDownloadRef` | Returns REF encoding; `_finalize_lazy_batch` post-CB registered once per batch regardless of item count; emitted datum has correct fqcn/ref_id/DOT | | `TestNoMemoryAccumulation` | `_CtxKey.OBJECTS` absent after PASS_THROUGH (no download transaction opened); `DownloadService._tx_table` unchanged; 50-cycle repeat produces no state bleed | `tests/unit_test/fuel/f3/streaming/test_pass_through_e2e.py` (5 tests, real TCP Cells) | Test | What is verified | |------|-----------------| | `test_arrays_survive_pass_through_hop` | Full round-trip: server → CJ (PASS_THROUGH) → subprocess, arrays arrive bit-exact | | `test_cj_holds_only_lazy_refs_not_tensor_data` | After PASS_THROUGH deserialization, CJ holds only `LazyDownloadRef`, never `np.ndarray` | | `test_cj_creates_no_download_transaction` | `DownloadService._tx_table` is unchanged during PASS_THROUGH + re-serialization | | `test_forwarded_payload_carries_original_server_ref` | Forwarded datum contains original server `fqcn` and `ref_id` — subprocess downloads from server, not CJ | | `test_multiple_array_roundtrip` | 8-array batch all survive with bit-exact values | `tests/integration_test/data/jobs/pt_large_model_pass_through/` Full-stack integration test using `PTClientAPILauncherExecutor` with `launch_once=True` (the pattern used by `llm_hf`): - **Model**: `LargeNet` — 3-layer MLP with ~8 MB of float32 parameters, well above the 2 MB ViaDownloaderDecomposer streaming threshold. This forces the real B1 code path (streaming + PASS_THROUGH) rather than the native inline path used by small models. - **Client script**: Mirrors `llm_hf/client.py` structure (`while flare.is_running()` loop, receive / train / send). Uses CPU-only synthetic data — no dataset download required in CI. - **Added to** `client_api.yml` as `"run pt-large-model-pass-through"`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
9f15e77 to
bbb9526
Compare
| @@ -126,6 +155,23 @@ def initialize(self, fl_ctx: FLContext) -> None: | |||
| ) | |||
| self._external_pre_init_timeout = timeout_value | |||
There was a problem hiding this comment.
PASS_THROUGH not restored on partial initialize() failure
If super().initialize(fl_ctx) succeeds but the subsequent ValueError is raised at line 151 (invalid EXTERNAL_PRE_INIT_TIMEOUT), _restore_pass_through is never called in that code path. The try/except on lines 138-142 only guards super().initialize(), leaving the exception raised at line 151 without a restore call.
At that point, self._cell_with_pass_through and self._prev_pass_through are already set, and PASS_THROUGH=True is live on the cell. Recovery depends entirely on the NVFlare framework guaranteeing a finalize() call even after a partial initialize() failure — which is not enforced here.
The minimal fix is to extend the guarded block to cover the full post-setup phase:
try:
super().initialize(fl_ctx)
# Check for top-level config override for external_pre_init_timeout
config_timeout = get_client_config_value(fl_ctx, EXTERNAL_PRE_INIT_TIMEOUT)
if config_timeout is not None:
timeout_value = float(config_timeout)
if timeout_value <= 0:
self.log_error(fl_ctx, f"Invalid EXTERNAL_PRE_INIT_TIMEOUT: {timeout_value}s (must be positive)")
raise ValueError(f"EXTERNAL_PRE_INIT_TIMEOUT must be positive, got {timeout_value}")
self.log_info(
fl_ctx,
f"Overriding external_pre_init_timeout from config: {self._external_pre_init_timeout}s -> {timeout_value}s",
)
self._external_pre_init_timeout = timeout_value
except Exception:
self._restore_pass_through(fl_ctx)
raise| if isinstance(target, LazyDownloadRef): | ||
| fobs_ctx = manager.fobs_ctx | ||
| lazy_batch_key = f"{self.prefix}{_LAZY_BATCH_CTX_SUFFIX}" | ||
| if lazy_batch_key not in fobs_ctx: | ||
| # First LazyDownloadRef of this batch: register a post-callback | ||
| # that will add the single shared datum (fqcn + ref_id) after all | ||
| # items have been serialised. | ||
| fobs_ctx[lazy_batch_key] = {"fqcn": target.fqcn, "ref_id": target.ref_id} | ||
| manager.register_post_cb(self._finalize_lazy_batch) | ||
|
|
||
| self.logger.debug( | ||
| f"ViaDownloader: re-emitting LazyDownloadRef {target.item_id=} " f"{target.fqcn=} {target.ref_id=}" | ||
| ) | ||
| return {EncKey.TYPE: EncType.REF, EncKey.DATA: target.item_id} |
There was a problem hiding this comment.
Silent data corruption if two batches with different ref_ids are serialised in the same message
The check-then-set on lazy_batch_key (lines 250-255) records only the first LazyDownloadRef's fqcn and ref_id:
if lazy_batch_key not in fobs_ctx:
fobs_ctx[lazy_batch_key] = {"fqcn": target.fqcn, "ref_id": target.ref_id}
manager.register_post_cb(self._finalize_lazy_batch)If a later LazyDownloadRef in the same serialisation call carries a different ref_id (e.g., two independent server batches merged into one forwarded payload), its item_id is emitted with a REF encoding (line 260) but the single datum added by _finalize_lazy_batch still points to the first batch's ref_id. The subprocess would then attempt to resolve every item against that one ref, silently returning wrong tensors or a download error for items from the second batch.
The current architecture guarantees a single batch per type per message (one _DecomposeCtx per decomposer type), so this is not a live bug today. However, adding an assertion protects against future regressions and documents the invariant explicitly:
if lazy_batch_key not in fobs_ctx:
fobs_ctx[lazy_batch_key] = {"fqcn": target.fqcn, "ref_id": target.ref_id}
manager.register_post_cb(self._finalize_lazy_batch)
else:
# All LazyDownloadRefs in one message must belong to the same server batch.
existing = fobs_ctx[lazy_batch_key]
assert existing["ref_id"] == target.ref_id, (
f"LazyDownloadRef ref_id mismatch: expected {existing['ref_id']!r}, "
f"got {target.ref_id!r}. Multiple server batches in one message are not supported."
)|
This PR depends on cherry-pick of 4211 |
This PR introduces the pass-through architecture for
ClientAPILauncherExecutor, eliminating tensor materialisation at the CJ (Client Job) process when large models are exchanged between the FL server and a subprocess agent.In large-model federated learning (e.g., 7B–70B LLM fine-tuning), the CJ process today acts as a blind relay that fully deserializes and re-serializes every tensor it receives from the FL server before forwarding to the subprocess. For a 70B float16 model, this consumes ~140 GB of CJ memory and requires two complete network transfers. B1 pass-through removes both costs.
NVFlare's multi-hop execution path for
launch_external_process=Truelooks like:Each tensor in the global model is handled as follows at CJ:
For large models, this means:
This is the reason why workflows are infeasible for large models beyond what the CJ machine can hold.
With
FOBSContextKey.PASS_THROUGHenabled on CJ's cell FOBS context, the data path becomes:CJ holds only lightweight placeholders (< 100 bytes per tensor). The subprocess downloads each tensor directly from the FL server — CJ is never involved in the tensor data path.
(
nvflare/fuel/utils/fobs/__init__.py)A new context key that signals ViaDownloaderDecomposer to skip the download step and create lazy placeholders instead.
A small sentinel object (four fields:
fqcn,ref_id,item_id,dot) created byrecompose()in PASS_THROUGH mode. It carries the original FL server's FQCN, batch ref_id, intra-batch item ID, and Datum Object Type — everything the subprocess needs to download the tensor directly.A named sentinel stored in
fobs_ctx[items_key]during PASS_THROUGH receive. Using a typed class (rather than a plain tuple) makes the PASS_THROUGH branch unambiguous and immune to accidental type collisions with real item dicts.A new auto-registered FOBS decomposer for
LazyDownloadRef. When CJ re-serializes a task containingLazyDownloadRefobjects:decompose()delegates toget_dot_handler(lazy.dot)— the originalViaDownloaderDecomposersubclass (e.g.,TensorDecomposer,NumpyArrayDecomposer). That handler's_finalize_lazy_batchpost-callback re-emits the original server datum (fqcn + ref_id + DOT) so the subprocess knows exactly where to download from.lazy_dotis appended to the encoding dict for routing on the receive side.recompose()useslazy_dotto look up the handler and delegates tohandler.recompose(), which retrieves the real tensor fromfobs_ctx[handler.items_key](populated byprocess_datum()when the subprocess received the forwarded datum).The
dot(Datum Object Type) field on bothLazyDownloadRefand_LazyBatchInfoensures that numpy arrays stay withNumpyArrayDecomposerand PyTorch tensors stay withTensorDecomposer, preserving type safety through the full pass-through hop.(
client_api_launcher_executor.py)On startup, the executor enables PASS_THROUGH on the engine cell's FOBS context:
This single line activates the full B1 architecture for every job that uses
launch_external_process=True— includingllm_hfand any recipe that callsScriptRunner(launch_external_process=True).The pipe (CellPipe) operates on already-serialized bytes. Intercepting at the pipe level would require parsing FOBS binary format, re-writing datum references, and re-assembling the byte stream — fragile and tightly coupled to the wire format.
Intercepting at the FOBS decomposer level is the natural extension point: decomposers already control exactly when and how data is materialised. PASS_THROUGH simply adds a "don't materialise" branch to that existing mechanism.
The subprocess must know which
ViaDownloaderDecomposersubclass owns the downloaded data so it can store it in the correctfobs_ctx[items_key]and routerecompose()correctly. Thedotfield, set when the server originally serialized the tensor, carries this type information through the pass-through hop without any type-switching logic.| Stage | Before (tensor materialised) | After (B1 pass-through) | |-------|------------------------------|------------------------| | CJ receive | Full model size (e.g., 140 GB) | ~100 bytes per tensor | | CJ forward | Creates new download tx | Re-emits original server datum |
| Subprocess receive | Downloads from CJ | Downloads directly from FL server |
launch_external_process=Trueautomatically activates B1.dotpropagation preserves tensor type (numpy / pytorch) through the hop without any if/elif type switching.launch_external_process=Trueautomatically benefit. No config or script changes required.launch_external_process=False(in-process executor) are completely unaffected —ClientAPILauncherExecutor.initialize()is not called.PASS_THROUGH— behaviour is identical to before.LazyDownloadRefDecomposeris auto-registered via the existingregister_foldermechanism; no explicit registration call is needed by any caller.nvflare/fuel/utils/fobs/__init__.pyFOBSContextKey.PASS_THROUGHnvflare/fuel/utils/fobs/decomposers/via_downloader.pyLazyDownloadRef,_LazyBatchInfo, PASS_THROUGH branches inprocess_datum()/recompose(),LazyDownloadRefDecomposer,_finalize_lazy_batchpost-callbacknvflare/app_common/executors/client_api_launcher_executor.pyinitialize()enables PASS_THROUGH on engine cell(22 tests)
TestLazyDownloadRef__slots__, per-item distinctnessTestLazyBatchInfo__slots__,isinstancereliability vs plain tupleTestProcessDatumPassThrough_LazyBatchInfo, never calls_download_from_remote_cell; normal mode calls downloadDownloadService._tx_tableunchanged; 50-cycle repeat produces no state bleedtests/unit_test/fuel/f3/streaming/test_pass_through_e2e.py(5 tests, real TCP Cells)test_arrays_survive_pass_through_hopDownloadService._tx_tableis unchanged during PASS_THROUGH + re-serializationtest_forwarded_payload_carries_original_server_reffqcnandref_id— subprocess downloads from server, not CJtest_multiple_array_roundtriptests/integration_test/data/jobs/pt_large_model_pass_through/Full-stack integration test using
PTClientAPILauncherExecutorwithlaunch_once=True(the pattern used byllm_hf):LargeNet— 3-layer MLP with ~8 MB of float32 parameters, well above the 2 MB ViaDownloaderDecomposer streaming threshold. This forces the real B1 code path (streaming + PASS_THROUGH) rather than the native inline path used by small models.llm_hf/client.pystructure (while flare.is_running()loop, receive / train / send). Uses CPU-only synthetic data — no dataset download required in CI.client_api.ymlas"run pt-large-model-pass-through".🤖 Generated with Claude Code
Fixes # .
Description
A few sentences describing the changes proposed in this pull request.
Types of changes
./runtest.sh.