[2.7] Fix FOBS deserialization RCE via type_name whitelist (NVBug #5968367)#4294
Open
nvidianz wants to merge 190 commits intoNVIDIA:mainfrom
Open
[2.7] Fix FOBS deserialization RCE via type_name whitelist (NVBug #5968367)#4294nvidianz wants to merge 190 commits intoNVIDIA:mainfrom
nvidianz wants to merge 190 commits intoNVIDIA:mainfrom
Conversation
NVIDIA#3896) Fixes # . ### Description Apply changes from NVIDIA#3878 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Allow only one project admin Disable patching role by creator ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [x] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
### Description Cherry pick fixes from main NVIDIA#3882 NVIDIA#3904 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com>
### Description Changed provision to use http as default scheme. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
### Description Backported Downloader rework and bug fixes frommain. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
### Description Cherry-pick of PR NVIDIA#3898 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [x] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Fixes # . ### Description Changed all info level log to debug in ViaDownloader ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
…IDIA#3911) Convert PSI example from JSON configs to use recipe - Converts PSI example from JSON configs to use recipe - Adds DhPSIRecipe <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Fixes # . ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # . ### Description Updated CIFAR-10 PyTorch examples to use the Recipe API and Client API for running federated learning jobs, including easy setup of experiment tracking and homomorphic encryption with MLFlow in real-world setting. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Signed-off-by: Holger Roth <hroth@nvidia.com>
Fixes # . ### Description - Restructure CIFAR-10 PT example to separate the client training scripts for each algorithm. - Add additional documentation. - Rerun experiments with fixed model initialization seed. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Signed-off-by: Holger Roth <hroth@nvidia.com>
Fixes # . ### Description Refactor CIFAR-10 TensorFlow example to use recipe-style jobs. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Signed-off-by: Holger Roth <hroth@nvidia.com>
…BAnalyticsReceiver (NVIDIA#3918) Cherry pick consolidate BaseFedJob and fedavg.py and Fix import error for TBAnalyticsReceiver. ### Description Cherry pick consolidate BaseFedJob and fedavg.py and Fix import error for TBAnalyticsReceiver. Dependencies in the second one made it complicated to do separately, so cherry picking them together. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Cherry-pick NVIDIA#3906 and NVIDIA#3915 and NVIDIA#3921 ### Description Cherry-pick NVIDIA#3906 and NVIDIA#3915 and NVIDIA#3921 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: nvkevlu <55759229+nvkevlu@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
…#3929) Preflight check tool was hardcoded to test against GRPC communication scheme. We have added more schemes and now our default is HTTP so we should change accordingly - Update preflight check - Remove overseer test - Update preflight check tests <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Fixes # . ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # . ### Description Add missing `load_cifar10_with_retry` function. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Fixes # . ### Description Apply same updates for KM example to 2.7 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Fixes # . ### Description The documentation now correctly positions this as a testing/development example that demonstrates the limitations of running server and client together in a time-limited SLURM job, while hinting that production deployments would require a different architecture (standalone server with resident clients). ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # . ### Description Add missing tensorboard requirements to hello-world examples. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Cherry pick cross site eval NVIDIA#3923 ### Description Cherry pick cross site eval NVIDIA#3923 . ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Fixes # . ### Description Improve error message in client API if `flare.init()` hasn't been called. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
### Description 1. Removed references to wss as a scheme/protocol. It's only used internally by aiohttp library. 2. Increased the downloader timeout in decomposers to 60s. It was causing download failures for large models. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Cherry pick NVIDIA#3930 NVIDIA#3945 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
These 2 examples are the same, consolidate them. ### Description These 2 examples are the same, consolidate them. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Fixes # . ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # . ### Description - Expose launch_once option in ScriptRunner to be used by recipes. - Adds new unit tests. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [x] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Cherry-picked from commit 7e87abc (PR NVIDIA#3942 ) ### Description Simplifies the cross-site evaluation (CSE) API by removing unnecessary parameters, adding framework auto-detection, and auto-configuring validators. This change dramatically reduces the cognitive load for users adding CSE to their federated learning workflows. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Fixes # . ### Description Fix the central TensorBoard logging path to be consistent with federated examples ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Fixes # . ### Description Convert bionemo examples to use FedAvgRecipe ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [x] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Signed-off-by: Holger Roth <hroth@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # . ### Description Apply LLM updates to 2.7 branch ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Fixes # . ### Description GNN updates to 2.7 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
…IA#3970) Fixes # . ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
## Issues addressed This PR addresses the reported recipe bug list: - `BUG-8` (Critical): `CyclicRecipe` `model=None + initial_ckpt` created phantom persistor reference and runtime failure. - `BUG-1` (Critical): `CyclicRecipe` dict model config caused `TypeError`. - `BUG-7` (Critical): base `CyclicRecipe` silently ignored `initial_ckpt`. - `BUG-9` (High): `FedAvgRecipe` `per_site_config[\"server\"]` caused invalid target handling/crash. - `BUG-4` (High): `Recipe.execute/export` mutated job state and made recipe reuse unsafe. - `BUG-2` (Medium): `FedAvgRecipe` per-site `or` fallback ignored falsy overrides. - `BUG-3` (Medium): `SimEnv` with `num_clients=0` and client list resolved `num_threads=0`. - `BUG-5` (Low): base `FedAvgRecipe` accepted `initial_ckpt` + `model=None` but produced incomplete model source. - `BUG-6` (Low): `add_cross_site_evaluation` idempotency guard depended only on transient `_cse_added` attribute. ## Changes - Hardened base `CyclicRecipe` model/persistor setup: - fail-fast when a valid persistor cannot be configured; - support dict model config for supported frameworks; - apply/validate `initial_ckpt` for wrapper and framework-specific paths. - Hardened base `FedAvgRecipe`: - validate and reject reserved `per_site_config` targets (`server`, `@ALL`); - preserve falsy per-site override values via explicit `is not None` fallback; - fail-fast when no persistor and no model params are available. - Added shared framework persistor setup utility in recipe utils and wired both Cyclic/FedAvg paths to it. - Made `Recipe.execute()` and `Recipe.export()` reusable-safe by snapshot/restore of temporary execution params. - Fixed `SimEnv` default resolution for `num_clients`/`num_threads` when explicit client list is provided. - Hardened CSE idempotency with workflow-based detection in addition to `_cse_added` fast-path. - Updated impacted examples and exports: - `hello-cyclic` now passes `min_clients` explicitly; - `hello-numpy` uses transfer-type API; - recipe module exports include `add_cross_site_evaluation`. - Expanded recipe unit tests with targeted regressions for all above behaviors. ## Reason for these changes These updates enforce clear model-source contracts and remove silent/implicit fallback behavior that could generate invalid jobs or runtime crashes. The goal is deterministic recipe behavior, safe object reuse, and predictable per-site/CSE configuration semantics in 2.7. ## Affected files Core implementation: - `nvflare/recipe/cyclic.py` - `nvflare/recipe/fedavg.py` - `nvflare/recipe/spec.py` - `nvflare/recipe/sim_env.py` - `nvflare/recipe/utils.py` - `nvflare/recipe/poc_env.py` - `nvflare/recipe/__init__.py` Framework wrappers / examples: - `nvflare/app_opt/pt/recipes/cyclic.py` - `nvflare/app_opt/tf/recipes/cyclic.py` - `examples/hello-world/hello-cyclic/job.py` - `examples/hello-world/hello-numpy/job.py` - `examples/tutorials/job_recipe.ipynb` Tests: - `tests/unit_test/recipe/cyclic_recipe_test.py` - `tests/unit_test/recipe/fedavg_recipe_test.py` - `tests/unit_test/recipe/spec_test.py` - `tests/unit_test/recipe/sim_env_test.py` - `tests/unit_test/recipe/utils_test.py` ## Test strategy - Targeted regression suites: - `pytest -q tests/unit_test/recipe/cyclic_recipe_test.py tests/unit_test/recipe/fedavg_recipe_test.py tests/unit_test/recipe/spec_test.py` - `pytest -q tests/unit_test/recipe/sim_env_test.py tests/unit_test/recipe/utils_test.py` - Full recipe suite: - `pytest -q tests/unit_test/recipe` - Result: recipe suite passed (`183 passed, 13 skipped`). Made with [Cursor](https://cursor.com)
…e enhancements (NVIDIA#4225) ## Summary This PR addresses several related bugs and improvements in the Client API subprocess path, swarm learning, and the recipe interface. Address FLARE-2743 (bug/5921094) and (bug/5921708) --- ## Issues ### 1. LazyDownloadRef dtype error with pass-through streaming When pass-through tensor streaming is enabled on the CJ, param converters (e.g. `NumpyToPTParamsConverter`) were running on the CJ *before* lazy references were materialized. This caused `"Could not infer dtype of LazyDownloadRef"` warnings and potential data corruption on large-model jobs using pass-through transfer. ### 2. Subprocess Client API had no converter wiring The `ExProcessClientAPI` (subprocess path) created a `FlareAgentWithFLModel` with no converters attached. Format conversion was happening on the CJ side in `LauncherExecutor.execute()`, which triggered issue #1. ### 3. Missing `learn_task_check_interval` in `CCWFJob.add_swarm()` `SwarmClientConfig.learn_task_check_interval` was not forwarded to the swarm controller, so any customization of that parameter was silently ignored. ### 4. `extract_participants()` did not support `{"targets": [...]}` dict form The deploy map validation only handled list-form and `{"sites": [...]}` dict-form entries. The `{"targets": [...]}` form (used by some job generators) raised an unhandled error. ### 5. `SimpleSwarmLearningRecipe` missing parameters `min_clients`, `launch_external_process`, and `command` could not be set through the recipe interface, forcing users to construct jobs manually. ### 6. 3rd-party integration docs referenced deprecated `FlareAgentWithCellPipe` The class has been superseded by constructing a `CellPipe` and `FlareAgent` directly, or using the higher-level Client API (`nvflare.client`). --- ## Issues fixed from review comments ### 7. `task_name` read from absent shareable header in `FlareAgentWithFLModel.shareable_to_task_data()` The task name was read via `shareable.get_header(FLContextKey.TASK_NAME, "")`, but this header is never populated in the subprocess path — the task name is carried as the pipe `Message.topic` and stored in `self.current_task.task_name` by `FlareAgent._on_task()`. An absent header caused `task_name` to fall back to `""`, making `ParamsConverter.process()` silently skip conversion. Fixed to use `self.current_task.task_name if self.current_task else ""`, consistent with `task_result_to_shareable()`. ### 8. Double-conversion when `from_nvflare_converter_id` / `to_nvflare_converter_id` set on launcher executor `LauncherExecutor.execute()` still applied CJ-side converters if users explicitly set `from_nvflare_converter_id` / `to_nvflare_converter_id`, while the subprocess was also applying its own default converters — resulting in double conversion. Fixed by removing the converter calls from `LauncherExecutor.execute()` entirely. Since `LauncherExecutor` is only ever subclassed by `ClientAPILauncherExecutor` (subprocess path), this has no effect on the in-process path. --- ## Approach **Converter logic removed from CJ launcher executors; moved to the subprocess agent side.** Previously, `PTClientAPILauncherExecutor` and `TFClientAPILauncherExecutor` set up format converters in their `initialize()` methods, and `LauncherExecutor.execute()` applied them on the CJ before dispatching to the subprocess. This was the root cause of issue #1 — the CJ was touching tensor data (triggering dtype inference) before `LazyDownloadRef` placeholders were materialized in the subprocess. The converter setup blocks have been **completely removed** from both launcher executors. Converters for the subprocess path now live entirely inside the subprocess, wired into `FlareAgentWithFLModel` and executing after full payload materialization. Key changes: - `FlareAgentWithFLModel` gains optional `from_nvflare_converter` / `to_nvflare_converter` params. A lightweight `_ConverterContext` stub (duck-typing `FLContext.get_prop/set_prop`) allows `ParamsConverter.process()` to work without a real `FLContext` in the subprocess. - `SERVER_EXPECTED_FORMAT` is added to `ConfigKey` and propagated in the subprocess launch config so the agent knows what format the server expects. - A shared factory `converter_utils.create_default_params_converters()` is introduced, replacing four separate copies of the same format-conditional logic across PT and TF executors. - In-process executors (`PTInProcessClientAPIExecutor`, `TFInProcessClientAPIExecutor`) are **unchanged in behavior** — they still apply converters on the CJ side, which is correct since there is no separate subprocess. --- ## Reason for this approach over alternatives An earlier iteration kept the converter setup in `initialize()` but added a `ClientAPILauncherExecutor.execute()` override that nulled the converters out before calling `super().execute()` and restored them after. This was removed because it left converters alive on the object but always bypassed — confusing dead code. The current approach is explicit and honest: launcher executors do not set converters at all, and the subprocess handles its own conversion autonomously. --- ## Affected files | Area | Files | |------|-------| | Core bug fixes | `ccwf_job.py`, `fed_utils.py` | | Converter framework | `converter_utils.py` (new), `config.py`, `flare_agent_with_fl_model.py`, `ex_process/api.py`, `client_api_launcher_executor.py` (base) | | PT launcher executor — converter removed | `pt/client_api_launcher_executor.py` | | PT in-process executor — refactored to shared factory | `pt/in_process_client_api_executor.py` | | TF launcher executor — converter removed + `initialize()` deleted | `tf/client_api_launcher_executor.py` | | TF in-process executor — refactored to shared factory | `tf/in_process_client_api_executor.py` | | Recipe | `pt/recipes/swarm.py` | | Docs | `3rd_party_integration.rst`, `3rd_party_trainer.py` | | Tests | 4 new test files, 4 modified test files, 2 integration YAMLs | --- ## Trade-offs and future work - **TF/Keras converter path is not directly unit-tested** due to TF/PyTorch dependency conflicts in a shared environment. The `create_default_params_converters` Keras branch is structurally identical to the PT branch (which is fully tested). A dedicated TF-only test environment could add coverage in a future pass. - **`_ConverterContext` is a duck-type shim.** If more `ParamsConverter` use cases arise in subprocess contexts, a formal lightweight interface (separate from `FLContext`) would be worth extracting. - **Converter plugin (`from_nvflare_converter_id`)** for the subprocess path is currently not wired — only the auto-created default converters are passed to the agent. Explicit user-provided converter IDs are still only supported for the in-process path. This can be addressed in a follow-up. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
### Description Cherry pick of PR NVIDIA#4229. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [x] New tests added to cover the changes. - [x] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
…ctive client exclusion, and dead-detection debounce (NVIDIA#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…VIDIA#4231) ## Problem Statement Three related issues were discovered in the NVFlare client-side stack: ### Issue 1: RSS Memory Growth Across FL Rounds After each federated learning round, the client process RSS grew monotonically. Model parameters — both the received global model and the locally trained model — were kept alive long after `flare.send()` completed. The reference chain kept them live until the next `receive()` overwrote `self.fl_model`, meaning one full round's worth of tensors was always double-buffered in memory. ### Issue 2: Race Condition in `PipeHandler.send_to_peer()` — Premature `delete_msg_root` The original `send_to_peer()` had: ```python finally: # the msg_id is also used as msg_root_id delete_msg_root(msg.msg_id) ``` This was a race condition. `CellPipe.send_request()` returns as soon as the peer acknowledges receipt via `make_reply(ReturnCode.OK)` in `_receive_message` — which happens **before** the peer finishes downloading model params. The `finally` block fired `delete_msg_root(msg.msg_id)` and destroyed the active download transaction while the CJ was still pulling tensors from the subprocess, causing intermittent download failures. ### Issue 3: Silent Logging Gap in Subprocess Mode When `launch_external_process=True`, `SubprocessLauncher` spawns `python3 -u client.py`. This new process starts with Python's unconfigured logging system — `lastResort` only emits `WARNING+`, so every `logger.info()` inside the subprocess is silently dropped. All INFO-level diagnostics (NVFlare internals, user training progress) were invisible in subprocess mode. --- ## Approach and Rationale ### Fix 1: Release Model Params Eagerly After `send()` — commit `39eb631d` **Files:** `nvflare/client/in_process/api.py`, `nvflare/client/ex_process/api.py`, `nvflare/client/model_registry.py` After `flare.send()`, both the sent model's params and the received global model's params are dead weight — already serialized and transmitted. We now explicitly null them immediately when `clear_cache=True` (the default): - **`InProcessClientAPI`**: Nulls `model.params`, `model.optimizer_params`, `self.fl_model.params`, `self.fl_model.optimizer_params` before clearing `self.fl_model`. - **`ExProcessClientAPI`**: Calls `model_registry.release_params(model)` which nulls both the sent and received model's params. **Why not `gc.collect()`?** CPython's reference counting reclaims objects as soon as their refcount drops to zero — `gc.collect()` only helps with reference *cycles*. PyTorch tensors don't form cycles. Once the params dicts are nulled, tensor memory is freed immediately. Confirmed experimentally: RSS stays flat without `gc.collect()`. **Why null params specifically?** Params (the large tensor dicts) are the dominant memory consumer. Nulling them frees the tensors while keeping the `FLModel` shell (metadata, metrics, round) alive for any code that reads those fields after `send()`. ### Fix 2: Remove Premature `delete_msg_root` from `PipeHandler.send_to_peer()` — commit `39eb631d` **File:** `nvflare/fuel/utils/pipe/pipe_handler.py` Removed the `finally: delete_msg_root(msg.msg_id)` block and its import. The `msg_id` doubles as the `msg_root_id` used by `ViaDownloaderDecomposer` to track download transaction lifecycle. Calling `delete_msg_root` in a `finally` block races against the peer's download: the ack is immediate but the actual tensor download is asynchronous. Download transaction cleanup now correctly happens in `wf_comm_server.py` and `task_controller.py`, which call `delete_msg_root` only after the task is fully processed server-side. ### Fix 3: Configure Subprocess Python Logging on `flare.init()` — commit `bc6d87847` **File:** `nvflare/client/ex_process/api.py` — added `_configure_subprocess_logging()` Immediately after loading `client_api_config.json`, `init()` calls `_configure_subprocess_logging(client_config)`, which: 1. Reads `workspace_dir` from `client_api_config.json` under `TASK_EXCHANGE.pipe.ARG.workspace_dir` 2. Loads `{workspace_dir}/local/log_config.json` (the site's standard NVFlare log config) 3. Calls `apply_log_config(dict_config, workspace_dir)` to invoke `logging.config.dictConfig()` This wires up `consoleHandler -> sys.stdout` (captured by `SubprocessLauncher`), `logFileHandler -> log.txt`, and `jsonFileHandler -> log.json` — exactly as the main NVFlare process configures them. The entire method is wrapped in `try/except Exception: pass` — a logging setup failure must never crash the training script. Called before the `try` block in `init()` so that even errors during `flare_agent.start()` are properly logged. ### Fix 4: RSS Memory Profiling Utility — commit `bc6d87847` **New file:** `nvflare/fuel/utils/mem_utils.py` Zero-cost diagnostic to measure client-side RSS per round: - Gated by `NVFLARE_CLIENT_MEMORY_PROFILE=1` (off by default). Name is client-scoped so server/relay processes are unaffected. - `_ENABLED` evaluated once at import time — single boolean check per call, zero overhead when disabled. - Uses `logger.info()`: after Fix 3, works correctly in both modes. RSS lines land in `log.txt`/`log.json` alongside all other NVFlare diagnostics. - `psutil` imported lazily; entire call is `try/except`-wrapped so missing install is a silent no-op. - Instrumented at `receive()` and `send()` in both `InProcessClientAPI` and `ExProcessClientAPI`. --- ## Files Changed | File | Change | |------|--------| | `nvflare/client/api_spec.py` | `_maybe_cleanup_memory()` base implementation and memory management attributes | | `nvflare/client/in_process/api.py` | Eager param release in `send()`, `log_rss()` instrumentation, `configure_memory_management()` | | `nvflare/client/ex_process/api.py` | Eager param release, `_configure_subprocess_logging()`, `log_rss()` instrumentation, memory GC config | | `nvflare/client/model_registry.py` | Added `release_params()` method | | `nvflare/fuel/utils/pipe/pipe_handler.py` | Removed racy `finally: delete_msg_root(msg.msg_id)` and its import | | `nvflare/fuel/utils/mem_utils.py` | **New file** — `log_rss(tag)` RSS profiling utility | | `docs/client_api.rst` | Clarified param lifecycle and `clear_cache` semantics | --- ## Verification Tested with `hello-pt` (CIFAR-10, 2 clients, 2 rounds) using `NVFLARE_CLIENT_MEMORY_PROFILE=1`: **In-process mode:** ``` round=0 after_receive: 629.2 MB round=0 after_send: 270.9 MB <- params freed immediately round=1 after_receive: 283.4 MB round=1 after_send: 448.5 MB <- flat, no accumulation ``` **Subprocess mode (`launch_external_process=True`):** ``` round=0 after_receive: 623.5 MB round=0 after_send: 129.4 MB <- params freed immediately round=1 after_receive: 136.1 MB round=1 after_send: 121.5 MB <- flat, no accumulation ``` RSS stabilizes after round 0 in both modes. Subprocess logging confirmed: `logger.info()` lines now appear in `site-N/log.txt` (written directly by the subprocess) and captured via `SubprocessLauncher`. ## Test plan - [x] In-process RSS profiling with `NVFLARE_CLIENT_MEMORY_PROFILE=1` — `[RSS]` lines in `site-N/log.txt` - [x] Subprocess RSS profiling with `launch_external_process=True` — `[RSS]` lines in both subprocess log and `SubprocessLauncher` capture - [x] Subprocess logging gap fixed — `logger.info()` calls now visible (previously silently dropped) - [x] Memory flat across rounds in both modes (no monotonic growth) - [x] `flare.send(clear_cache=False)` path untouched — params only released when `clear_cache=True` - [x] `delete_msg_root` race removed — download transactions no longer prematurely destroyed 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ent, hierarchical startup stability [skip ci] (NVIDIA#4218) ## Merge Dependency >⚠️ **Depends on NVIDIA#4209** — The *Hierarchical FL Startup Stability* section documents changes introduced by PR NVIDIA#4209 (currently open). **Merge PR NVIDIA#4209 into `2.7` before merging this PR.** All other sections cover already-merged PRs. --- ## Summary This PR updates `docs/release_notes/flare_272.rst` to reflect all major changes merged into the 2.7.x line after the initial 2.7.2 draft, covering three new areas: - **Memory Management** — restructured and expanded with Zero Tensor Copy at CJ (PR NVIDIA#4210) and client-side memory lifecycle management (PR NVIDIA#4211) - **F3 Streaming Reliability and Performance** — new section covering HOL stall mitigation (PR NVIDIA#4206), stream pool starvation fix (PR NVIDIA#4171/NVIDIA#4172), streaming download retry (PR NVIDIA#4167), RxTask self-deadlock fix (PR NVIDIA#4204), and lock contention reduction (PR NVIDIA#4174) - **Hierarchical FL Startup Stability** — new section covering deployment timeout classification, startup grace period, selective client exclusion, and metadata hardening (PR NVIDIA#4209 — pending merge), with recommended config snippets for HPC/Lustre environments The Bug Fixes section and intro paragraph are also updated accordingly. A source-level RST comment has been added above the Hierarchical FL section in the file to alert future maintainers to the merge dependency. ## Merged PRs Documented | PR | Area | Status | |---|---|---| | NVIDIA#4171 / NVIDIA#4172 | Stream pool starvation fix | Merged | | NVIDIA#4174 | Lock contention reduction | Merged | | NVIDIA#4167 | Streaming download retry | Merged | | NVIDIA#4204 | RxTask self-deadlock fix | Merged | | NVIDIA#4206 | HOL stall mitigation | Merged | | NVIDIA#4210 | Zero tensor copy at CJ | Merged | | NVIDIA#4211 | Client-side memory management | Merged | | NVIDIA#4209 | Hierarchical FL startup stability | **Open — merge before this PR** | ## Changes ### Memory Management (restructured) - **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds `LazyDownloadRef` placeholders instead of materializing full tensors, eliminating the CJ as a memory bottleneck for LLM-scale models. - **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` / jemalloc purge / `torch.cuda.empty_cache()` injected after every `flare.send()`, configurable via `client_memory_gc_rounds`. - Existing TensorDownloader and server-side cleanup content retained. ### F3 Streaming Reliability and Performance (new section) - **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress watchdog, and stall detection/recovery. Includes recommended environment variable settings for large hierarchical deployments. - **Stream Pool Starvation Fix**: Blob callbacks dispatched to a dedicated `callback_thread_pool`, keeping stream workers free for concurrent downloads. - **Streaming Download Retry**: Exponential-backoff retry (up to 3 attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware. - **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock` released, eliminating stream-error-triggered deadlock. - **Lock Contention Reduction**: `produce_item()` runs outside `self.lock`; compare-and-store for cache write. Reduces model-download latency under high client concurrency. ### Hierarchical FL Startup Stability (new section — pending PR NVIDIA#4209) - **Deployment Timeout as Failure**: `reply=None` correctly counted against `min_sites`; timed-out clients excluded before `start_client_job`. - **Startup Grace Period**: Dead-client detection debounced — client must be observed once before absence triggers dead-job notification. Default changed to `True`. - **Selective Client Exclusion**: Stragglers at start-job excluded rather than causing full abort, if remaining count ≥ `min_clients`. - **Metadata Hardening**: `TypeError` on absent job metadata replaced with descriptive `RuntimeError`. - Recommended `config_fed_server.json` / `config_fed_client.json` snippets for HPC (Frontier/ORNL) scale. ## Test plan - [ ] Sphinx build (`make html`) passes without RST warnings on the updated file - [ ] All new cross-references (`.. code-block::`, `.. note::`) render correctly in the docs build - [ ] Verify section hierarchy (underline characters) is consistent throughout the file - [ ] Confirm PR NVIDIA#4209 is merged before this PR is merged 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes # . ### Description ## FedOpt: default config_type for optimizer/scheduler, support class_path ### Why FedOptRecipe failed at config load when `optimizer_args` / `lr_scheduler_args` omitted `config_type`: the component builder tried to instantiate the optimizer (e.g. `torch.optim.SGD`) before the model existed, so `params` were missing. Users had to remember to set `config_type: "dict"`. Optimizer/scheduler configs also didn't support `class_path`, unlike model_config. ### What - **Recipe utils:** Added `ensure_config_type_dict()` — sets `config_type: "dict"` when missing and, if only `class_path` is set, copies it to `path` so the config layer keeps working. - **FedOptRecipe (PT & TF):** Run optimizer_args and lr_scheduler_args through `ensure_config_type_dict()` so `config_type` is no longer required; accept both `path` and `class_path`. - **PT shareable generator:** Resolve display name from `path` or `class_path`. - **Examples/docs:** Use `class_path` for optimizer/scheduler; remove explicit `config_type` from FedOpt examples; fix cifar10_fedopt README to use FedOptRecipe and the correct API. - **Tests:** Cover both `path` and `class_path` for PT and TF; skip PT tests when PyTorch is missing and TF tests when TensorFlow is missing or fails to load, so either venv can run the suite without failures. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [x] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [x] In-line docstrings updated. - [x] Documentation updated.
…NVIDIA#4239) Fixes # . ### Description Cherry-pick NVIDIA#4237 Enables consistent use of class_path for all objects configured through the job API. ComponentBuilder.get_class_path(): use 'path' or 'class_path' (path takes precedence) ComponentBuilder.is_class_config(): treat 'class_path' like 'path' wfconf get_class_path(): same behavior for workflow/config parsing Add tests: class_path-only config, path overrides class_path when both present ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Fixes # . ### Description Additional enhancements from NVIDIA#4240 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [x] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [x] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…ders (NVIDIA#4243) Fixes # . ### Description Cherry-pick NVIDIA#4241 ## Summary - replace the line-based license grep in runtest.sh with ci/check_license_header.py to validate the full Apache 2.0 header (year-flexible) - keep the intended exception for public-domain files and keep existing exclusions for protos and modeling_roberta.py - normalize inconsistent license headers in affected files (missing full block and minor text typos) ## Testing - python3 ci/check_license_header.py nvflare examples tests integration research ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
…HROUGH, download gating, MSG_ROOT_TTL, LazyRef local-aggr, logging, min_clients, diagnostics (NVIDIA#4247) ## Summary Comprehensive fix series for large-model federated training with subprocess clients. Addresses memory accumulation root causes, correctness crashes, timeout misconfiguration, and operational diagnostics across 18 numbered fixes. **Data-flow context** (subprocess-mode per round): ``` Forward path: FL Server / Aggregator → [scatter] → Trainer CJ → [task delivery] → Subprocess Reverse path: Subprocess → [result] → Trainer CJ → [relay] → FL Server ``` --- ## Problems & Approaches ### Fix 1 (Critical) — Serialize once, cache result on `Message` object **Problem**: `CellPipe.send()` re-serialized the payload on every retry, creating a new `ArrayDownloadable` download transaction per attempt. With unbounded retries on a 5 GiB model, each failed send accumulated a fresh ~5 GiB transaction, leading to OOM. **Approach**: Cache the serialized `CellMessage` on the `Message` object (e.g. `msg._cached_cell_msg`) on the first send. Retries reuse the same bytes and the same download transaction — no re-serialization, no transaction accumulation. **Files**: `nvflare/fuel/utils/pipe/cell_pipe.py` ### Fix 2 — Explicit cache teardown after retry loop **Problem**: The cached `CellMessage` held references to encoded bytes and the FOBS download transaction long after the retry loop exited, delaying garbage collection. **Approach**: `PipeHandler._send_to_pipe()` calls `pipe.release_send_cache(msg)` unconditionally in a `try/finally` after the loop. The transaction is not cancelled on each retry (the peer may be mid-download); it is released exactly once after the loop terminates. **Files**: `nvflare/fuel/utils/pipe/pipe_handler.py` ### Fix 3 — `submit_result_timeout` configurable, sized for large-model transfer **Problem**: `FlareAgent.submit_result_timeout` defaulted to 60 s — far too short for a 5 GiB transfer. There was no way to set it from the job config. **Approach**: Add `ConfigKey.SUBMIT_RESULT_TIMEOUT` to `ClientConfig`; wire through `ClientAPILauncherExecutor` → `prepare_config_for_launch()` → `FlareAgentWithFLModel` → `FlareAgent`. Operator sets it per job via `recipe.add_client_config({"submit_result_timeout": 1800})`. **Files**: `nvflare/client/config.py`, `nvflare/app_common/executors/client_api_launcher_executor.py`, `nvflare/client/ex_process/api.py` ### Fix 4 — Release `base_obj` in `clear_cache()` after download completes **Problem**: `CacheableObject.transaction_done()` cleared `self.cache` but not `self.base_obj`, keeping the numpy source array alive after all receivers had finished downloading from it. **Approach**: Set `self.base_obj = None` in `transaction_done()` so the reference is dropped deterministically at transaction completion, not deferred to GC. **Files**: `nvflare/fuel/f3/streaming/cacheable.py` ### Fix 5 — `peer_read_timeout` / `submit_result_timeout` alignment **Problem**: `peer_read_timeout` (CJ waiting for subprocess to read the task) was 1800 s, but `submit_result_timeout` (subprocess waiting for CJ to ACK result) defaulted to 60 s — structurally inconsistent. A large-model send could exhaust `submit_result_timeout` multiple times before CJ finished downloading. **Approach**: With Fix 3 making `submit_result_timeout` configurable, both timeouts are now set consistently at job-config level. Fix 13 adds a startup warning when they are mismatched. ### Fix 6 — CJ ACKs task receipt before downloading **Problem**: `CellPipe._receive_message()` returned `ReturnCode.OK` only after FOBS deserialization completed inline — meaning the subprocess waited for CJ to finish downloading before receiving the ACK. On slow networks or large models, the subprocess timed out waiting for ACK while CJ was still downloading. **Approach**: Return the pipe-level ACK immediately on receipt, then decode the payload asynchronously. This decouples the acknowledgment latency from the download transfer time. **Files**: `nvflare/fuel/utils/pipe/cell_pipe.py` ### Fix 7 — Gate PASS_THROUGH on `CellPipe` only **Problem**: `ClientAPILauncherExecutor.initialize()` enabled PASS_THROUGH regardless of pipe type. For `FilePipe` subprocesses (third-party integrations), the subprocess cannot resolve cell FQCNs to pull from the server's `DownloadService` directly — enabling PASS_THROUGH in that mode silently breaks deserialization. **Approach**: Check `isinstance(self.pipe, CellPipe)` before enabling PASS_THROUGH. `FilePipe` mode keeps PASS_THROUGH disabled and CJ continues to download and re-serialize into the file. **Files**: `nvflare/app_common/executors/client_api_launcher_executor.py` ### Fix 8 — Enable reverse PASS_THROUGH (subprocess → CJ → server) **Problem**: On the result path, CJ decoded the subprocess's result (downloading ~1 GB of tensors), then re-encoded and re-sent to the server (another ~1 GB upload). CJ materialized the full model twice with no benefit — it was a pure relay. **Approach**: Enable `PASS_THROUGH=True` on the subprocess↔CJ pipe cell. CJ creates `LazyDownloadRef` objects instead of downloading tensors. When CJ re-encodes for the server, `LazyDownloadRefDecomposer` re-emits the subprocess's original `fqcn + ref_id`. The server pulls directly from the subprocess's `DownloadService` — CJ never materializes the result tensors. `MSG_ROOT_TTL` is stamped on the outgoing cell message to keep the subprocess transaction alive for the full download. **Files**: `nvflare/app_common/executors/client_api_launcher_executor.py`, `nvflare/fuel/utils/pipe/cell_pipe.py` ### Fix 9 — `_MIN_DOWNLOAD_TIMEOUT` configurable via job config **Problem**: `_MIN_DOWNLOAD_TIMEOUT` was a hardcoded 60 s constant. For 70 B models the inter-chunk gap can exceed 60 s on any realistic network, causing `_monitor_tx` to kill an active transaction mid-transfer. **Approach**: Add `ConfigVarName.MIN_DOWNLOAD_TIMEOUT`; read via `acu.get_positive_float_var()` in `_create_downloader()`, following the same pattern as `STREAMING_PER_REQUEST_TIMEOUT`. Operator sets `np_min_download_timeout` or `tensor_min_download_timeout` in job config. 60 s remains the fallback default. **Files**: `nvflare/apis/fl_constant.py`, `nvflare/fuel/utils/fobs/decomposers/via_downloader.py` ### Fix 10 — Bound `max_resends`, wire via job config **Problem**: `FlareAgent` accepted `max_resends` but it defaulted to `None` (unlimited) and was not exposed through the job config system. Unlimited retries with Fix 1's caching are safe, but still produce an indefinite wait on persistent failures. **Approach**: Add `ConfigKey.MAX_RESENDS` to `ClientConfig`; wire through `ClientAPILauncherExecutor` → `prepare_config_for_launch()` → `FlareAgentWithFLModel` → `FlareAgent` → `PipeHandler`. Default: 3. **Files**: `nvflare/client/config.py`, `nvflare/app_common/executors/client_api_launcher_executor.py`, `nvflare/client/ex_process/api.py` ### Fix 12 — Client-job-side RSS logging and memory GC **Problem**: Subprocess had `log_rss()` + `cleanup_memory()` via `APISpec._maybe_cleanup_memory()`, but the client-job (CJ) process had neither. Per-round RSS growth in CJ was invisible. **Approach**: Override `check_output_shareable()` in `ClientAPILauncherExecutor` to call `log_rss()` (gated on `NVFLARE_CLIENT_MEMORY_PROFILE=1`) and `_maybe_cleanup_cj_memory()` after each result relay. Reuses the existing `memory_gc_rounds` and `cuda_empty_cache` config params. **Files**: `nvflare/app_common/executors/client_api_launcher_executor.py` ### Fix 12a — Harden subprocess logging config loading, prevent duplicate file writes **Problem**: `_configure_subprocess_logging()` hardcoded `local/log_config.json`, missing `.default`/`.conf`/`.yml` variants common in provisioned deployments. Subprocess file handlers duplicated every log line with the parent's rotating log files. **Approach**: Use `ConfigFactory.load_config(WorkspaceConstants.LOGGING_CONFIG, search_dirs=[local_dir])` so all variants are resolved. Strip all non-console handlers before applying the config — subprocess logs flow exclusively through `SubprocessLauncher`'s stdout pipe to the parent. **Files**: `nvflare/client/ex_process/api.py` **Tests**: `tests/unit_test/client/test_logging_and_rss_tags.py` (6 tests) ### Fix 12b — Compact RSS role tags **Problem**: RSS markers (`after_receive`, `after_send`, `client_job`) were verbose and did not identify which process emitted them. **Approach**: Standardize to `"CA s={site} r={round} recv/send"` (ClientAgent/subprocess) and `"CJ s={site} t={task} r={round} relay"` (ClientJob). Short, greppable, and role-attributable. **Files**: `nvflare/client/ex_process/api.py`, `nvflare/client/in_process/api.py`, `nvflare/app_common/executors/client_api_launcher_executor.py` **Tests**: `tests/unit_test/client/test_logging_and_rss_tags.py` (10 tests) ### Fix 13 — Validate timeout consistency at job start **Problem**: Timeout mismatches (e.g. `min_download_timeout < streaming_per_request_timeout`) were silent until a download failed mid-transfer; the operator had no actionable warning. **Approach**: `_validate_timeout_config()` called at end of `initialize()` logs warnings for: `min_download_timeout < streaming_per_request_timeout`, `submit_result_timeout > min_download_timeout`, and `max_resends is None`. Uses `log_warning` (not `raise`) so the job still runs. **Files**: `nvflare/app_common/executors/client_api_launcher_executor.py` ### Fix 14 (Critical) — Reverse PASS_THROUGH: per-message header, per-call FOBS decode context **Problem**: PASS_THROUGH was enabled globally on the engine cell's FOBS context. The engine cell handles all CJ messages — not just task-result messages from the subprocess, but also Swarm P2P aggregation results from other trainers. Swarm P2P messages decoded with `PASS_THROUGH=True` produced `LazyDownloadRef` objects passed to aggregation math (`weights[k] += v`), crashing with a dtype error. **Approach**: - `CellPipe.pass_through_on_send=True` (set in `ExProcessClientAPI.init()`) stamps `MessageHeaderKey.PASS_THROUGH=True` on outgoing task-result messages from the subprocess. - `Adapter.call()` reads the header **per-message** and builds a per-call FOBS decode context. Only messages explicitly stamped with `PASS_THROUGH=True` are decoded as `LazyDownloadRef`. - Swarm P2P messages arrive without the header → `PASS_THROUGH=False` in decode ctx → tensors decoded normally, no crash. **Reason**: Per-message header is the only safe granularity; a shared cell FOBS context cannot distinguish result-relay from P2P aggregation traffic. **Files**: `nvflare/fuel/f3/cellnet/cell.py`, `nvflare/fuel/f3/cellnet/defs.py`, `nvflare/fuel/utils/pipe/cell_pipe.py`, `nvflare/client/ex_process/api.py` **Tests**: `tests/unit_test/fuel/f3/cellnet/test_pass_through_header.py` (11 tests) ### Fix 15 — MSG_ROOT_TTL on Swarm P2P learn-task messages **Problem**: `broadcast_and_wait()` unconditionally overwrote any pre-set `MSG_ROOT_TTL` with the ACK timeout (~seconds). Swarm P2P scatter messages carried this short TTL instead of the full `learn_task_timeout` (hours), causing `_monitor_tx` to kill large-model download transactions mid-transfer. **Approach**: - `SwarmClientController._scatter()` stamps `ReservedHeaderKey.MSG_ROOT_TTL = float(learn_task_timeout)` on task data **before** `deepcopy` so the TTL reaches the cell router. - `task_controller.broadcast_and_wait()` uses an explicit `is not None` guard to preserve a pre-set TTL instead of overwriting it. **Files**: `nvflare/apis/impl/task_controller.py`, `nvflare/app_common/ccwf/swarm_client_ctl.py` **Tests**: `tests/unit_test/app_common/ccwf/test_msg_root_ttl.py` (6 tests) ### Fix 16 (Critical) — `DOWNLOAD_COMPLETE_CB` gates subprocess exit on server download **Problem**: The subprocess exited as soon as `send_to_peer()` received the CJ ACK (a sub-second event after Fix 6). Exiting tore down the subprocess's `DownloadService` while the server was still mid-download from it — corrupting or truncating the transfer. **Approach**: - `FlareAgent._do_submit_result()` registers a `threading.Event` as `FOBSContextKey.DOWNLOAD_COMPLETE_CB` in the subprocess cell's FOBS context before serialization, then waits up to `download_complete_timeout` seconds for the event to fire. - `via_downloader._create_downloader()` wires `DOWNLOAD_COMPLETE_CB` as `transaction_done_cb` on `ObjectDownloader`; the event is set when the server's final chunk ACKs. **Files**: `nvflare/client/flare_agent.py`, `nvflare/fuel/utils/fobs/decomposers/via_downloader.py`, `nvflare/client/config.py`, `nvflare/app_common/executors/client_api_launcher_executor.py`, `nvflare/client/ex_process/api.py` **Tests**: `tests/unit_test/client/test_download_complete_gating.py` (12 tests) ### Fix 17 — Resolve LazyDownloadRefs before local swarm aggregation **Problem**: When the swarm aggregator and a trainer ran on the same CJ site, `_local_submit()` bypassed the network path. The result tensors remained as `LazyDownloadRef` objects and reached `shareable_to_learnable()` unevaluated. The aggregation function inside `shareable_to_learnable()` expected real numpy arrays and failed with a runtime error when it encountered the opaque ref objects. **Approach**: Force-resolve all `LazyDownloadRef` objects in the local result before passing it to `_local_submit()`. This mirrors the implicit resolution that occurs via cell deserialization on the remote path. **Files**: `nvflare/app_common/ccwf/swarm_client_ctl.py` ### Fix 18 — Forward-path PASS_THROUGH for swarm scatter (feature-flagged, off by default) **Problem**: When the aggregator scattered the global model to trainer CJs, each CJ downloaded the full model (copy #1), deep-copied it for the local learn thread (copy NVIDIA#2), then streamed it to the subprocess. Peak CJ RSS ≈ 2× model size during scatter with no algorithmic necessity. **Approach**: Feature-flag that enables `PASS_THROUGH=True` on the forward scatter path. With the flag on, the CJ acts as a pure relay — it never materializes the model tensors, forwarding streaming ref-IDs directly to the subprocess, which pulls the model directly from the aggregator's `DownloadService`. Eliminates both copies from CJ memory. **Reason**: Feature-flagged to allow gradual rollout; correctness on all Swarm topologies requires validation before enabling by default. **Files**: `nvflare/app_common/ccwf/swarm_client_ctl.py`, `nvflare/fuel/f3/cellnet/cell.py` --- ## Infrastructure improvements (committed alongside Fix 18) | # | Change | Files | |---|--------|-------| | A | `min_clients: int = 0` on `ServerSideController` — workflow tolerates partial dropout; default 0 = all clients required (backward compatible) | `nvflare/app_common/ccwf/server_ctl.py` | | B | Subprocess stdout passed through directly via `print()` — no re-logging wrapper, no added timestamp or prefix (subprocess lines already carry their own log formatting) | `nvflare/app_common/launchers/subprocess_launcher.py` | | C | Split download-gate timing into two log lines: CJ ACK latency + server download duration | `nvflare/client/flare_agent.py` | | D | `conn_manager.py` shutdown race: set `stopped=True` before executor shutdown; wrap `executor.submit()` in `try/except RuntimeError` | `nvflare/fuel/f3/sfm/conn_manager.py` | | E | `_Transaction` tracks `start_time` + `total_bytes`; `transaction_done()` logs elapsed and MB transferred | `nvflare/fuel/f3/streaming/download_service.py` | | F | `logging.captureWarnings(True)` in `apply_log_config()` so `ResourceWarning` / `DeprecationWarning` reach file handlers | `nvflare/fuel/utils/log_utils.py` | | G | `try/finally: sess.close()` in `internal_submit_job()` — fix gRPC session resource leak | `nvflare/tool/job/job_cli.py` | --- ## Post-Review Fixes The following issues were identified during local PR review and resolved in follow-up commits. ### R1 — `min_clients` fault-tolerance: simplified to warning-only (`server_ctl.py`) **Original approach**: After the configure phase, failed clients were removed from `participating_clients`, `result_clients`, and `client_statuses`, a membership-update broadcast was sent, and `starting_client` was reselected if the original starter had failed (see R6). **Rolled back**: The pruning logic was reverted because removing clients from `participating_clients` mid-startup introduced round-deadlock scenarios (Gatherer's expected-response set diverged from the live client set across rounds) and the `starting_client` reselection added complexity that masked the root fault. The `progress_timeout` sentinel bug (R5) was also a direct consequence of the pruning code. **Current behavior**: If `configured_count >= min_clients`, the workflow proceeds. Failed clients that did not configure are logged as a warning — they remain in `participating_clients` and may rejoin in a later round. If `configured_count < min_clients`, the workflow panics as before. **Files**: `nvflare/app_common/ccwf/server_ctl.py` ### R2 (Must Fix) — Guard `_validate_timeout_config` imports against `ImportError` (`client_api_launcher_executor.py`) **Problem**: Both lazy imports inside `_validate_timeout_config` could raise `ImportError` in constrained environments, crashing `initialize()` and aborting the job start. **Fix**: Imports wrapped in `try/except ImportError` with a `log_warning` fallback. ### R3 (Should Fix) — Gate defensive `LazyDownloadRef` guard behind `forward_pass_through` (`swarm_client_ctl.py`) **Problem**: The guard in `_end_gather()` called `_from_shareable(aggr_result)` unconditionally on every aggregation round, deserializing the full aggregated DXO (with all weight data) just to inspect types — wasted work in the common non-PASS_THROUGH case. **Fix**: Guard is now gated behind `if self.forward_pass_through`. Also removed the redundant local `ReservedHeaderKey` import (already at module level). ### R4 (Nit) — Remove dead `finalize()` pass-through; rename `_round_count` → `_cj_round_count` (`client_api_launcher_executor.py`) `finalize()` only called `super().finalize()` and was deleted. `_round_count` renamed to `_cj_round_count` to avoid a naming collision with `APISpec._round_count`. ### R5 — Rolled back (was: `progress_timeout` never fires when `min_clients > 0`) This issue was a direct consequence of the R1 pruning logic (`overall_last_progress_time = now` made the sentinel equal to current time). Resolved by rolling back the pruning in R1 — the `progress_timeout` sentinel is unaffected by the current warning-only approach. ### R6 — Rolled back (was: Pruned `starting_client` still used) This issue existed only when failed clients were pruned from `participating_clients`. Since pruning was rolled back in R1, the `starting_client` is always one of the configured participants and no reselection is needed. ### R7 (CI failure) — `test_swarm_memory_gc` stub missing `forward_pass_through` attribute **Problem**: The R4 fix gated the `LazyDownloadRef` guard in `_end_gather()` behind `if self.forward_pass_through`, but `_make_controller()` in `test_swarm_memory_gc.py` used `__new__` and never set the attribute — 7 tests failed with `AttributeError`. **Fix**: Added `ctrl.forward_pass_through = False` to the stub. ### R8 — `min_clients=None` default in `SwarmServerController` crashes job init (`swarm_server_ctl.py`) **Problem**: `SwarmServerController.__init__` defaulted `min_clients=None` and passed it directly to `ServerSideController.__init__()`, which performs `if min_clients < 0:`. In Python 3, `None < 0` raises `TypeError` — crashing job initialization whenever `SwarmServerController` is instantiated without an explicit `min_clients` (e.g. from a JSON config that omits the field). **Fix**: Changed the default to `min_clients: int = 0` to match `ServerSideController`'s own default. **Files**: `nvflare/app_common/ccwf/swarm_server_ctl.py` ## Test Strategy | Fix | Test file | Tests | |-----|-----------|-------| | Fix 14 (reverse PASS_THROUGH header) | `tests/unit_test/fuel/f3/cellnet/test_pass_through_header.py` | 11 | | Fix 15 (MSG_ROOT_TTL) | `tests/unit_test/app_common/ccwf/test_msg_root_ttl.py` | 6 | | Fix 16 (download gating) | `tests/unit_test/client/test_download_complete_gating.py` | 12 | | Fix 12a/b (logging + RSS tags) | `tests/unit_test/client/test_logging_and_rss_tags.py` | 16 | | Infra B (subprocess log passthrough) | `tests/unit_test/app_common/launchers/subprocess_launcher_test.py` | 2 | | Infra A (min_clients fault tolerance) | `tests/unit_test/app_common/ccwf/test_server_ctl_min_clients.py` | 30 | All existing unit tests pass; no regressions. ### Clarification: PASS_THROUGH Scope in Swarm P2P 1. Reverse path mirrors forward path: `subprocess -> CJ -> aggregator`. 2. The subprocess sets PASS_THROUGH on its outgoing pipe message, so CJ receives proxy/lazy refs first. 3. CJ then forwards/submits toward the aggregator in Swarm. 4. In Swarm, aggregator may be remote CJ, same node CJ, or same-process CJ logic path, but subprocess is always a different process from CJ. 5. Aggregation must operate on resolved arrays (not unresolved lazy refs). --- ## Latest Commit: Review Feedback (cb85eba) ### Addressed - **Style**: Rename `pt` → `passthrough` variable in `cell.py` (YuanTingHsieh review) - **Style**: Remove `_` prefix on `from_shareable` import in `swarm_client_ctl.py` (nvidianz review) - **Lint**: Fix all 19 flake8 F401 unused-import warnings across integration and unit test files - **Naming**: Rename `test_via_downloader_fix9.py` → `test_via_downloader_download_timeout.py` ### Open review items (tracked in `docs/fix_list.md`) - **YT1**: `LazyDownloadRef` guard in `_end_gather()` — change from `log_error` + recover to `system_panic` + return (agreed, pending implementation) - **YT2**: `MSG_ROOT_TTL > task.timeout` leaves DownloadService transaction alive after send failure — bounded resource leak, tracked as follow-up - **YT**: `min_clients` naming collision (job-scheduler vs fault-tolerance quorum) — needs design decision before H4/H5 wiring - **YT**: `progress_timeout` interaction with `min_clients` dropout tolerance — follow-up item - **nvidianz**: `_decomposer_prefix()` in PT executor — confirmed used (base class override for `_validate_timeout_config`), keeping - **nvidianz**: `transaction_done()` info log verbosity — one log per completed model download; discussing gating strategy - **nvidianz**: Consolidate timeout params — review which can be combined or derived from each other **31 non-test production files** changed in this PR: **Core APIs (4)** - `nvflare/apis/fl_constant.py` — `ConfigVarName.MIN_DOWNLOAD_TIMEOUT` - `nvflare/apis/impl/task_controller.py` — pre-set `MSG_ROOT_TTL` preservation - `nvflare/apis/shareable.py` — `ReservedHeaderKey.PASS_THROUGH` - `nvflare/fuel/f3/cellnet/defs.py` — `MessageHeaderKey.PASS_THROUGH` **Swarm / CCWF (5)** - `nvflare/app_common/ccwf/ccwf_job.py` — `min_clients` wiring - `nvflare/app_common/ccwf/common.py` — membership update topic - `nvflare/app_common/ccwf/server_ctl.py` — `min_clients` fault tolerance (warning-only on partial dropout; pruning rolled back) - `nvflare/app_common/ccwf/swarm_client_ctl.py` — Fix 15/17/18, `_resolve_lazy_refs`, forward pass-through - `nvflare/app_common/ccwf/swarm_server_ctl.py` — `min_clients` forwarding; default fixed to `int = 0` **Executors / Launchers (3)** - `nvflare/app_common/executors/client_api_launcher_executor.py` — Fix 3/7/8/10/12/13/16 - `nvflare/app_common/launchers/subprocess_launcher.py` — direct stdout passthrough (no re-logging) - `nvflare/app_opt/pt/client_api_launcher_executor.py` — H2 param forwarding **Recipe (1)** - `nvflare/app_opt/pt/recipes/swarm.py` — recipe params **Client (5)** - `nvflare/client/config.py` — `submit_result_timeout`, `max_resends`, `download_complete_timeout` - `nvflare/client/constants.py` — config key constants - `nvflare/client/ex_process/api.py` — Fix 12a/12b/14/16 - `nvflare/client/flare_agent.py` — Fix 16 download gating - `nvflare/client/in_process/api.py` — Fix 12b RSS tags **Fuel / Streaming (5)** - `nvflare/fuel/f3/streaming/cacheable.py` — Fix 4 `base_obj` release - `nvflare/fuel/f3/streaming/download_service.py` — byte accounting, transaction logging - `nvflare/fuel/utils/fobs/__init__.py` — `FOBSContextKey.DOWNLOAD_COMPLETE_CB` - `nvflare/fuel/utils/fobs/decomposers/via_downloader.py` — Fix 9/16 - `nvflare/fuel/utils/memory_utils.py` — GC logging **Fuel / Pipe (3)** - `nvflare/fuel/utils/pipe/cell_pipe.py` — Fix 1/6/8/14 - `nvflare/fuel/utils/pipe/pipe.py` — `release_send_cache` interface - `nvflare/fuel/utils/pipe/pipe_handler.py` — Fix 2 cache teardown **Fuel / Cell & Infra (3)** - `nvflare/fuel/f3/cellnet/cell.py` — Fix 14 per-message PASS_THROUGH - `nvflare/fuel/f3/sfm/conn_manager.py` — shutdown race fix - `nvflare/fuel/utils/log_utils.py` — `captureWarnings` **Private / Other (2)** - `nvflare/private/aux_runner.py` — PASS_THROUGH header propagation - `nvflare/tool/job/job_cli.py` — session leak fix --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Add missing submit_model executor. ### Description Add missing submit_model executor to fix FLARE-2758. Added test and ensured that it passed to prevent missing this same path in the future. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Fixes FLARE-2757. ### Description Cyclic error with serialization, the root cause is not using proper controller ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…IA#4250) ### Fix SimEnv to be compatible with strict simulator_run API The simulator_run API originally rejected any call that provided both named clients and n_clients, since they are redundant — if you've named your clients, the count is already known. This PR fixes SimEnv properly instead: - sim_env.py: Fix SimEnv.deploy() to pass n_clients only when self.clients is None (i.e., for ALL_SITES expansion). When explicit client names are given, n_clients is omitted. - fed_job_test.py: Update unit tests to reflect the restored strict behavior. The root cause was that SimEnv always passed both n_clients and clients to simulator_run. The correct fix is in SimEnv, not in relaxing the API contract. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…4252) Cherry-pick of NVIDIA#4224 from `main` into `2.7`. Includes the metric filtering, warning de-duplication, and related unit tests. --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Fixes NVIDIA#4244. ### Description ## Problem When using a dict model config with `SimpleSwarmLearningRecipe`, the exported `config_fed_client.json` contained an empty `args: {}` for the persistor's model, losing all constructor arguments (e.g. `model_name_or_path` for HuggingFace models). Root cause: introduced in NVIDIA#4130, `_instantiate_model_from_dict` eagerly instantiated the model from the dict and passed the live `nn.Module` to `PTFileModelPersistor`. The job serializer (`_get_args`) then introspected the live object and could not recover the original constructor args (especially for HuggingFace models where args are buried in internal config, not stored as plain instance attributes). ## Fix Remove `_instantiate_model_from_dict` and pass the normalized dict `{"path": "...", "args": {...}}` directly to `PTFileModelPersistor`. The persistor already supports dict config and resolves it to an `nn.Module` at runtime via `instantiate_class()` — no eager instantiation needed at recipe construction time. Side benefit: large models (e.g. LLaMA-8B) are no longer loaded into memory just to export a job config. ## Changes - `nvflare/app_opt/pt/recipes/swarm.py`: remove `_instantiate_model_from_dict`, unused `importlib` import, and the `model_instance` indirection — pass `model` dict directly to `PTFileModelPersistor` ## Test `TestSimpleSwarmLearningRecipeExport.test_export_preserves_dict_model_args_in_client_config` — exports the job and asserts the persistor's model args are present in the JSON config. This test would have failed before this fix. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
Fixes FLARE-2760, bug/5949101, bug/5948918. ### Description Add ProdEnv info to edge example readme (already have it in docs/user_guide/data_scientist_guide/job_recipe.rst) Add model_state check to buff_model_manager Add timeout arg to buff_recipe ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
…NVIDIA#4253) ## Summary - add a FlowerRecipe version check that enforces `flwr>=1.16,<1.26` with clear runtime errors - add unit tests for missing/incompatible/compatible flwr versions - update Flower docs and hello-flower dependency files to document and install a compatible flwr range - align Flower executable error hint with the same requirement ## Testing - PYTHONDONTWRITEBYTECODE=1 pytest tests/unit_test/recipe/spec_test.py tests/unit_test/recipe/flower_recipe_test.py -q ## Issue Fixes Flower (flwr) CLI version incompatibility on `2.7` (`--federation-config` is not available in newer Flower CLI versions).
…ls [docs] (NVIDIA#4264) ## Summary - clarify TensorFlow recipe docs that class-instance model input should be a user-defined subclassed Keras model (or dict config) - add matching caveats for initial_ckpt guidance when a model is provided - fix examples/hello-world/hello-tf/README.md model snippet (it incorrectly showed PyTorch code) ## Scope - docs-only changes - no runtime code changes ## Testing - not run (documentation updates only)
…idation (NVIDIA#4258) ## Summary - add prominent TenSEAL provisioning guidance to `FedAvgRecipeWithHE` docstring, including a concrete code example to generate `server_context.tenseal` and `client_context.tenseal` - add a SimEnv preflight validation in `FedAvgRecipeWithHE.process_env()` that checks required context files and raises a descriptive error instead of failing later with an opaque missing-file error - include a direct docs link in the error message: https://nvflare.readthedocs.io/en/2.7/programming_guide/provisioning_system.html - add unit tests covering missing-context failure and successful validation when both files exist ## Testing - `TMPDIR=$PWD/.tmp_test PYTHONDONTWRITEBYTECODE=1 pytest -q -p no:cacheprovider tests/unit_test/recipe/fedavg_he_context_provisioning_test.py` (passes with skips when `tenseal` is not installed in the environment) --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Fix: Graceful handling when `executorch` is not installed (ET FedBuff)
## Problem
`ETFedBuffRecipe` and `ETTaskProcessor` imported `executorch` at module
level. Any environment without `executorch` installed would get a raw
`ModuleNotFoundError` deep in the import chain — with no indication of
what to install or why.
## Changes
**`nvflare/edge/models/model.py`**
- Replace top-level `from executorch.exir import to_edge` with
`optional_import(...)` — defers the error to call time so importing
`DeviceModel` no longer requires `executorch`
**`nvflare/edge/simulation/et_task_processor.py`**
- Same pattern for `_load_for_executorch_for_training_from_buffer` and
`get_sgd_optimizer` from `executorch.extension.training`
**`nvflare/edge/tools/et_fed_buff_recipe.py`**
- Add explicit guard in `ETFedBuffRecipe.__init__()`: checks
`importlib.util.find_spec("executorch")` and raises
`ImportError("ETFedBuffRecipe requires executorch. See installation
instructions:
https://pytorch.org/executorch/stable/getting-started-setup.html")` —
fires early with actionable guidance instead of a cryptic traceback
**`tests/unit_test/recipe/edge_recipe_test.py`**
- Add `TestETFedBuffRecipeSimBasic` — basic initialization test,
`@pytest.mark.skipif` when `executorch` is absent
- Add
`TestETFedBuffRecipeWithoutExecutorch.test_raises_import_error_without_executorch`
— mocks `find_spec` to return `None` so the negative test always runs
regardless of environment; asserts `ImportError` with `"executorch"` in
the message
## Notes
- HE is intentionally not supported in `SimEnv` — `PocEnv(use_he=True)`
is the correct local environment for HE workflows (it has full
provisioning that generates TenSEAL context files)
### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
update example readme ### Description update example readme, remove non-existed links ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… validate hang, round_timeout (NVIDIA#4270) ## Summary Fixes four bugs in NVFlare 2.7.2 RC12 affecting Swarm Learning with `launch_external_process=True` Implementation Reference @GeorgeWang-nv's original PR NVIDIA#4263 Bug 1 -- different implementation from PR4263 Bug 2 -- accept the implementation approach with small changes from PR 4263 Bug 3 -- use different approach Bug 4 -- add timeout to swarm API but consolidate the timeouts from 4 to 2 and expose to Swarm Recipes Bug 5 -- restore `self._max_resends` (private convention); subprocess logging fixed (see below) Bug 6 -- guard ConnManager against RuntimeError after shutdown; 4 regression tests added **Additional (review feedback):** - Rename `SimpleSwarmLearningRecipe` → `SwarmLearningRecipe` everywhere; backward-compat alias kept - Fix docs: add missing required `min_clients` arg to all `SwarmLearningRecipe` examples (TypeError for users copying doc examples) - CSE inventory scan: take LAST "best" key, not first — prevents initial checkpoint named "best_*" from shadowing trained best model - Add `pipe_type` / `pipe_root_path` to `SwarmLearningRecipe` so users can select `FilePipe` without dropping to low-level API ### Bug 1 — Download ref deleted before tensor receivers finish (`via_downloader.py`) **Root cause**: `_create_downloader()` subscribed to msg_root deletion to clean up download transactions. The msg_root is deleted as soon as blob envelopes are delivered, but `blob_cb` fires asynchronously — secondary tensor downloads were still in flight when `delete_transaction()` removed the ref from `_ref_table`, causing `"no ref found" FATAL_SYSTEM_ERROR` on large models (≥1B parameters). **Fix**: Remove `subscribe_to_msg_root` from `_create_downloader()` entirely. Transaction lifecycle is now managed solely by `_monitor_tx()` in `download_service.py`, which polls `is_finished()` every 5s and cleans up only after all chunk downloads complete. ### Bug 2 — CSE fails to load local model after ext-process training (`cse_client_ctl.py`) **Root cause**: `_prepare_local_model()` called `submit_model_executor.execute()` which, for `ClientAPILauncherExecutor`, launches a fresh subprocess with no trained model state. The training subprocess had already exited; the model was already saved to disk by `PTFileModelPersistor`. **Fix**: Try the persistor first. Inventory key scan prefers the LAST key containing `"best"` when `model_name` contains `"best"` — taking the last "best" key avoids mistaking an initial checkpoint named `"best_*"` (added first by `PTFileModelPersistor`) for the trained best model. `isinstance` guard replaces `assert isinstance` (safe under `python -O`). Falls back to executor for in-process mode compatibility. ### Bug 3 — Validate results hang 1800s on CSE round 2+ (`flare_agent.py` + `via_downloader.py`) **Root cause**: `FlareAgent._do_submit_result()` unconditionally waited 1800s for `DOWNLOAD_COMPLETE_CB` after every result send when `pass_through_on_send=True`. For validate results (metrics only, no tensors), `_finalize_download_tx()` creates no download transaction and the callback never fires — subprocess blocks indefinitely. **Fix**: Thread-local `was_download_initiated()` flag, set by `_finalize_download_tx()` only when downloadable objects exist. Agent returns immediately if `False` (validate result). Thread-local is required because task pipe and metric pipe share the same `CoreCell` (same `site_name + token + mode` → same FQCN → same `_CellInfo` cache entry → same `fobs_ctx`); a plain flag would be clobbered by concurrent metric serialisation from a different thread. ### Bug 4 — P2P model ACK timeout too short for large models (`swarm.py`) **Root cause**: `SwarmClientConfig` hardcodes `learn_task_ack_timeout=10s` and `final_result_ack_timeout=10s`. For large models (≥2 GB), P2P tensor streaming takes minutes — the ACK times out before download completes. **Fix**: Add `round_timeout: float = 3600` to `SwarmLearningRecipe`. Wires both `learn_task_ack_timeout` and `final_result_ack_timeout`; `learn_task_timeout` is intentionally left `None` (unbounded) to avoid capping per-round training time on slow hardware or 70B+ models. ### Bug 5 — `_max_resends` attribute shadowing (`client_api_launcher_executor.py`) **Root cause**: `ClientAPILauncherExecutor.__init__()` stored `max_resends` as `self._max_resends` but `TaskExchanger` (the parent) uses `self.max_resends`. `PipeHandler` reads `self.max_resends` and saw `None` (the parent default) instead of the configured value. **Fix**: Restore `self._max_resends` as the private attribute throughout `ClientAPILauncherExecutor` (consistent with private-member convention); the executor builds its own config dict explicitly so it does not rely on the inherited attribute. ### Subprocess logging fix (`subprocess_launcher.py` + `ex_process/api.py`) **Problem**: The subprocess had no logging configuration — `logger.info()` calls were silently dropped. Wrapping all stdout via `logger.info()` in the parent caused double-prefixed entries in `log.txt` for NVFlare-formatted log lines. **Fix**: `_configure_subprocess_logging()` in `ex_process/api.py` loads the site's `log_config.json` unchanged, giving the subprocess identical loggers to the parent (both consoleHandler + file handlers). The parent's `log_subprocess_output()` now calls `_route_subprocess_line()` which strips ANSI codes and checks for a `YYYY-MM-DD HH:MM:SS` timestamp: - Formatted NVFlare log line → `print()` to terminal only (file handler already wrote it to `log.txt`) - Raw `print()` from user training script → `logger.info()` so it reaches `log.txt` Regression tests: `test_log_subprocess_output_formatted_lines_not_double_logged` + 5 `TestRouteSubprocessLine` tests. ### Bug 6 — ConnManager crashes on frame arrival after shutdown (`conn_manager.py`) **Root cause**: `process_frame_task()` and `start_connector()` could raise unhandled `RuntimeError` when called after the executor was shut down, causing noisy tracebacks during job teardown. **Fix**: Guard `conn_mgr_executor.submit()` with try/except `RuntimeError` (log debug, skip). Add `if self.stopped: return` early exit in `process_frame_task()`. **Regression tests**: 4 new tests in `test_conn_manager_shutdown_race.py`. ### `pipe_type` / `pipe_root_path` for `SwarmLearningRecipe` `SwarmLearningRecipe` always created a `CellPipe`; users needing `FilePipe` (restricted networks, third-party integrations) had to drop to the low-level `CCWFJob` + `SwarmClientConfig` API. **Change**: Add `pipe_type: str = "cell_pipe"` and `pipe_root_path: Optional[str] = None`: - `"cell_pipe"` (default): unchanged — zero-copy PASS_THROUGH forwarding, ~1 GB RAM overhead - `"file_pipe"`: `FilePipe` created with `root_path` defaulting to `{WORKSPACE}/pipe` (resolved at runtime); custom absolute path validated to exist at recipe construction time - Warnings for `pipe_root_path` ignored with `cell_pipe`, and `file_pipe` with `launch_external_process=False` - `ScriptRunner.__init__` gains `task_pipe` parameter forwarded to `BaseScriptRunner` 9 new tests in `TestSwarmLearningRecipePipeType`. ## Files Changed | File | Bug | Change | |------|-----|--------| | `nvflare/fuel/utils/fobs/decomposers/via_downloader.py` | 1 + 3 | Remove `subscribe_to_msg_root`; add thread-local `_tls`, `was_download_initiated()`, `clear_download_initiated()` | | `nvflare/client/flare_agent.py` | 3 | Check `was_download_initiated()` after `send_to_peer()`; return immediately for validate results | | `nvflare/app_common/ccwf/cse_client_ctl.py` | 2 | Persistor-first `_prepare_local_model()` with last-"best"-key inventory scan and isinstance guard | | `nvflare/app_opt/pt/recipes/swarm.py` | 4 + rename + pipe | Add `round_timeout`, `pipe_type`, `pipe_root_path`; rename to `SwarmLearningRecipe`; backward-compat alias kept | | `nvflare/job_config/script_runner.py` | pipe | Add `task_pipe` param to `ScriptRunner`, forwarded to `BaseScriptRunner` | | `nvflare/app_common/ccwf/recipes/swarm.py` | rename | Export `SwarmLearningRecipe` alongside `SimpleSwarmLearningRecipe` | | `nvflare/app_opt/pt/recipes/__init__.py` | rename | Add `SwarmLearningRecipe` to lazy imports and `__all__` | | `nvflare/app_common/executors/client_api_launcher_executor.py` | 5 | Restore `self._max_resends` (private convention) | | `nvflare/app_common/launchers/subprocess_launcher.py` | logging | `_route_subprocess_line()`: formatted lines → `print()`, raw lines → `logger.info()` | | `nvflare/client/ex_process/api.py` | logging | `_configure_subprocess_logging()`: apply site log config unchanged (same handlers as parent) | | `nvflare/fuel/f3/sfm/conn_manager.py` | 6 | Guard executor submit + frame processing against post-shutdown RuntimeError | | `docs/programming_guide/memory_management.rst` | docs | Add missing `min_clients` to example | | `docs/user_guide/data_scientist_guide/available_recipes.rst` | docs | Add missing `min_clients`; rename to `SwarmLearningRecipe` | | `docs/programming_guide/controllers/client_controlled_workflows.rst` | docs | Add missing `min_clients` (x2) | | `docs/programming_guide/timeouts.rst` | docs | Rename to `SwarmLearningRecipe` | | `examples/advanced/swarm_learning/swarm_pt/README.md` | 4 + rename | Add `round_timeout` to config table; rename to `SwarmLearningRecipe` | | `examples/advanced/swarm_learning/swarm_pt/job.py` | rename | Update import to `SwarmLearningRecipe` | | `tests/unit_test/fuel/utils/fobs/test_via_downloader_msg_root.py` | 1 | 4 new tests | | `tests/unit_test/client/test_download_initiated_gating.py` | 3 | 7 new tests | | `tests/unit_test/app_common/ccwf/test_cse_persistor_fallback.py` | 2 | 10 new tests (incl. initial-ckpt-shadowing regression) | | `tests/unit_test/fuel/f3/sfm/test_conn_manager_shutdown_race.py` | 6 | 4 new regression tests | | `tests/unit_test/app_common/launchers/subprocess_launcher_test.py` | logging | 6 new tests for `_route_subprocess_line` and double-logging prevention | | `tests/unit_test/recipe/swarm_recipe_test.py` | rename + pipe | Updated to `SwarmLearningRecipe`; 9 new `TestSwarmLearningRecipePipeType` tests | | `tests/unit_test/app_common/executors/client_api_launcher_executor_test.py` | 5 | Updated: `_max_resends` assertions | | `tests/unit_test/recipe/component_config_verification_test.py` | rename | Updated to `SwarmLearningRecipe` | ## Test Plan - [x] 250+ unit tests pass across all affected modules - [x] Bug 1: `test_via_downloader_msg_root.py` — verifies `subscribe_to_msg_root` never called, method removed, `DOWNLOAD_COMPLETE_CB` unaffected - [x] Bug 2: `test_cse_persistor_fallback.py` — verifies persistor-first path, last-"best"-key preference, initial-ckpt-shadowing regression, isinstance guard, all fallback paths - [x] Bug 3: `test_download_initiated_gating.py` — verifies thread isolation, validate returns immediately (<1s), train waits, clear-before-send - [x] Bug 4: covered by existing recipe and ccwf tests - [x] Bug 5: `client_api_launcher_executor_test.py` updated (`_max_resends` assertions) - [x] Bug 6: `test_conn_manager_shutdown_race.py` — 4 tests: executor-shutdown no-raise, stop() no-raise, stopped early-return, Prefix.from_bytes not called - [x] Logging: `subprocess_launcher_test.py` — double-logging prevention, ANSI stripping, plain-line capture, partial-timestamp detection - [x] Pipe type: `swarm_recipe_test.py` — 9 tests: default cell_pipe, file_pipe instance, workspace template, custom path, invalid type, warnings, path validation - [x] Rename: backward-compat `SimpleSwarmLearningRecipe` import test passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes 5948918 . ### Description jobs created via ET subclass need to have the capability to set and pass this parameter ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
### Description Make integration tests' install_requirements more robust ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>
Fixes # . ### Description Summary of changes for the ScaffoldRecipe + Client API scaffold_c_diff KeyError: 1. Defensive check in scaffold_aggregate_fn File: nvflare/app_common/workflows/scaffold.py Before using _result.meta[AlgorithmConstants.SCAFFOLD_CTRL_DIFF], the code now checks that the key exists. If it is missing, it raises a ValueError that: Names the client (if present in meta) States that scaffold_c_diff is required in FLModel.meta Explains that the client must use PTScaffoldHelper: init(model), model_update() during training, terms_update() after training, and send get_delta_controls() in meta Points to nvflare.app_opt.pt.scaffold.PTScaffoldHelper So instead of an opaque KeyError: 'scaffold_c_diff', users get a clear message and next steps. 2. Stronger docstring for ScaffoldRecipe File: nvflare/app_opt/pt/recipes/scaffold.py The Client script requirement section now explicitly states that: Scaffold is not like FedAvgRecipe: the client must use PTScaffoldHelper The client must set meta[AlgorithmConstants.SCAFFOLD_CTRL_DIFF] = scaffold_helper.get_delta_controls() in the returned FLModel A plain flare.receive/send loop without PTScaffoldHelper will cause aggregation to fail ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
## Summary ### Original documentation updates - **Release notes**: large-model subprocess reliability section, 7 bug fix entries, edge recipe improvements - **available_recipes**: remove \`SimpleSwarmLearningRecipe\` alias, add \`round_timeout\`, subprocess tuning notes - **timeout_troubleshooting**: subprocess result timeout and Swarm P2P transfer scenarios, new quick-reference rows - **timeouts.rst**: \`max_resends\`, \`np/tensor_min_download_timeout\`, \`round_timeout\` cross-refs, updated Swarm example - **memory_management.rst**: client training process memory cleanup section, \`NVFLARE_CLIENT_MEMORY_PROFILE\` env var - **client_controlled_workflows.rst**: \`round_timeout\` in Swarm examples, Client Dropout Tolerance section - **industry_use_cases.rst**: FAITE, JP Morgan/BNY/RBC, ORNL OLCF, AV, Holoscan, Data Federation Mesh; FLARE Day talk links added ### Broken link and consistency fixes - **examples/README.md**: remove Section 2 step-by-step examples (notebooks no longer exist), fix CIFAR-10 paths (\`cifar10/\` → \`cifar10/pt/\`), fix \`lr-newton-raphson\` → \`hello-world/hello-lr\`, remove dead \`code-pre-install\` entry, renumber sections - **examples/advanced/README.md**: fix CIFAR-10 paths, remove dead \`random_forest\`, \`brats18\`, \`prostate\`, \`fl_hub\` entries - **examples/hello-world/README.md**: remove broken \`hello-ccwf\` link - **docs/examples/medical_image_analysis.rst**: fix \`brats18\`/\`prostate\` paths (\`examples/advanced/\` → \`research/\`) - **timeout_troubleshooting.rst**: fix LLM example \`submit_task_result_timeout\` (1200 → 1800) to be ≥ \`submit_result_timeout\`; clarify note on setting both timeouts consistently ### Web tutorial catalog updates (web/src/components/) - **gettingStarted.astro**: fix two broken \`getting_started/\` links — Step 3 now points to \`job_api/pt/src/cifar10_fl.py\`, Step 4 to readthedocs quickstart - **series.astro**: fix \`getting_started\` → \`hello-world\`, \`hello-fedavg\` → \`hello-pt\`, CIFAR-10 paths, \`brats18\`/\`prostate\` → \`research/\`, \`xgboost_secure\` path - **tutorials.astro**: add 17 missing examples — Job Recipe, Logging Tutorial, Hello PyTorch Lightning, Hello PyTorch Lightning Eval, Hello Differential Privacy, Hello Flower, Federated Logistic Regression with Newton-Raphson, Split Learning, PSI, Tensor Streaming for LLMs, AMPLIFY Protein Model, Confidential Computing Provision, Cross-Edge FL, BraTS18, FedHCA2, FedOBD, Federated Prostate Segmentation ## Test plan - [ ] Verify RST renders correctly - [ ] Check all external links resolve - [ ] Verify web tutorial catalog links are accessible 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Fix: subprocess cell torn down before server tensor download (large-model deadlock) ### Problem `big_model_4g` CI test reliably failed with: ``` RuntimeError: failed to download from site-X_active ``` after 1–3 hours of retry loops (300 s SubmitUpdate timeout × 3 download retries). **Root cause 1 — ordering bug in `_finalize_external_execution()`:** The subprocess client API works by keeping the subprocess's CellNet connection alive after it sends its result, so the server can pull large tensors directly from the subprocess's `DownloadService` (the "reverse PASS_THROUGH" path). To prevent the subprocess from exiting before the download completes, `_do_submit_result()` blocks on a `download_done.wait(1800 s)` event that fires only after the server has finished downloading all tensors. However, `_finalize_external_execution()` called `stop_task()` **synchronously**, which sends `SIGTERM` to the subprocess immediately after `execute()` receives the result — before `ClientRunner` had even sent `SubmitUpdate` to the server, let alone before the server had started downloading. This tore down the subprocess cell, causing every server download attempt to hit `"no connection to site-X_active"` / `"cannot forward req: no path"`. The subprocess-side wait was therefore unreachable: the process was killed externally before it could block. **Root cause 2 — `launch_once=True` subprocess exits after round 1:** `_do_submit_result()` called `os._exit(0)` unconditionally at the end of the `CellPipe + pass_through_on_send` path. For `launch_once=False` (one subprocess per round) this is correct — the process should exit immediately after the download so the deferred-stop poller unblocks. But for `launch_once=True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`, `pt_client_api_launch_once`), the subprocess was killed after round 1, leaving rounds 2–N unhandled and the job hanging. --- ### Fix **1. Deferred `stop_task()` (`launcher_executor.py`)** Instead of calling `stop_task()` synchronously, `_finalize_external_execution()` now starts a background thread (when `_stop_task_wait_timeout > 0`) that polls `check_run_status()` until the subprocess exits naturally (i.e. after `download_done` fires and the subprocess unblocks), then calls `stop_task()`. `execute()` returns immediately, so `ClientRunner` can send `SubmitUpdate` and the server can connect to the still-alive subprocess cell to download tensors. **2. Round-boundary coordination (`_deferred_stop_event`)** Without additional synchronization, the deferred thread from round N could fire *after* round N+1's `launch_task()` call. Since `SubprocessLauncher` guards `_start_external_process()` with `if self._process is None`, seeing a not-yet-cleared reference to the exited round-N process causes it to skip starting a new subprocess. Round N+1 then sees `COMPLETE_SUCCESS` immediately and fails with `"External process has not called flare.init and run status becomes success"`. A `threading.Event` (`_deferred_stop_event`, initially set) coordinates the two: - **Cleared** in `_finalize_external_execution()` just before the deferred thread starts. - **Set** unconditionally in a `finally` block when the deferred thread completes. - `_initialize_external_execution()` **waits** on this event before calling `launch_task()`, with a forced `stop_task()` fallback if it times out. In practice the wait is 0–1 s (subprocess exits naturally after download, well before the server completes global aggregation and dispatches the next round's task), so there is no meaningful latency impact. **3. `ClientAPILauncherExecutor` opt-in (`client_api_launcher_executor.py`)** `_stop_task_wait_timeout` is set to `download_complete_timeout` (default 1800 s) in `ClientAPILauncherExecutor.__init__()`, enabling the deferred path only for the subprocess client API where large-tensor downloads are expected. The base `LauncherExecutor` defaults to `0.0` (original synchronous behaviour, no change). **4. `launch_once`-aware subprocess exit (`flare_agent.py`, `config.py`, `ex_process/api.py`)** `prepare_config_for_launch()` now writes `launch_once` (derived from `launcher.needs_deferred_stop()`) into the subprocess config file. The subprocess reads it via `ClientConfig.get_launch_once()` and passes it to `FlareAgent`. `_do_submit_result()` branches on `_launch_once`: | `launch_once` | Behaviour after download gate | |---|---| | `False` (one subprocess per round, e.g. `pt-client-api`) | `os._exit(0)` called directly — deferred-stop poller unblocks immediately (original behaviour preserved) | | `True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`) | `atexit.register(os._exit, 0)` registered once — subprocess continues to next round; `os._exit` fires only when `main()` finally returns, bypassing non-daemon CoreCell thread cleanup | `_resolve_launch_once()` safely fetches the launcher even when `self.launcher` is still `None` at `initialize()` time (resolved directly from the engine component registry). **5. Pipe handler identity guard and safe close (`task_exchanger.py`)** Pipe handler creation is refactored into `_create_pipe_handler()`, which binds a per-handler status callback that checks `self.pipe_handler is _h` before acting. This prevents a late `PEER_GONE` from round N's (now-stale) handler from stopping round N+1's handler. `stop(close_pipe=False)` is used because `CellPipe.close()` is irreversible — closing it in the status callback would prevent the next round from communicating. An explicit `self.pipe.close()` is added in `END_RUN` instead. **6. Defensive logging and heartbeat error handling (`pipe_handler.py`)** `_send_to_pipe` now logs `asked_to_stop` and `abort_triggered` status when a send is suppressed. Peer-gone detection logs the elapsed time since the last heartbeat. Heartbeat sends are wrapped in `try/except` so a broken pipe sets `asked_to_stop` and breaks the heartbeat loop cleanly instead of propagating an unhandled exception. **7. New unit tests (`test_download_complete_gating.py`, `test_download_initiated_gating.py`)** Two new test files covering the subprocess download-gating behaviour (the `download_done` wait, the validate-path fast return, and `launch_once`). Both use `FlareAgent.__new__()` to construct minimal agent stubs. To prevent `os._exit(0)` from killing the pytest-xdist worker: - An `autouse` `_no_os_exit` fixture patches `nvflare.client.flare_agent.os._exit` to a no-op for every test in the file. - `_make_agent()` sets `agent._launch_once = False` (the per-round path that calls `os._exit` directly, making the fixture's patch the active guard). --- ### Files changed | File | Change | |------|--------| | `nvflare/app_common/executors/launcher_executor.py` | Deferred `stop_task()` background thread + `_deferred_stop_event` round-boundary coordination | | `nvflare/app_common/executors/client_api_launcher_executor.py` | Set `_stop_task_wait_timeout = download_complete_timeout`; add `_resolve_launch_once()`; write `LAUNCH_ONCE` to subprocess config | | `nvflare/app_common/abstract/launcher.py` | `needs_deferred_stop()` abstract method + idempotency/thread-safety note on `stop_task()` | | `nvflare/app_common/launchers/subprocess_launcher.py` | Implement `needs_deferred_stop()`; add info logging for process start/stop | | `nvflare/app_common/executors/task_exchanger.py` | Refactor pipe handler creation into `_create_pipe_handler()` with identity-checking callback; use `close_pipe=False` to prevent irreversible `CellPipe.close()` | | `nvflare/app_common/widgets/metric_relay.py` | Include `msg.data` in pipe status log message | | `nvflare/fuel/utils/pipe/pipe_handler.py` | Enhanced logging (send failures, peer-gone elapsed time); heartbeat send error handling | | `nvflare/client/config.py` | Add `LAUNCH_ONCE` config key + `get_launch_once()` | | `nvflare/client/flare_agent.py` | `launch_once`-aware `_do_submit_result()`: direct `os._exit` vs `atexit` | | `nvflare/client/ex_process/api.py` | Pass `launch_once` to `FlareAgentWithFLModel` | | `tests/unit_test/client/test_download_complete_gating.py` | **New** — tests for DOWNLOAD_COMPLETE_CB registration, download-done wait, timeout, status logging, and cleanup | | `tests/unit_test/client/test_download_initiated_gating.py` | **New** — tests for thread-local download-initiation detection (validate-path fast return, no spurious 1800 s wait) | | `tests/unit_test/app_common/executors/client_api_launcher_executor_test.py` | New tests for deferred stop, `_deferred_stop_event`, `_stop_task_wait_timeout` | ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
…68367) The FOBS Packer.unpack() path validated the decomposer name against BUILTIN_DECOMPOSERS but never validated the attacker-controlled fobs_type field before passing it to load_class(). An authenticated FL participant could set fobs_dc to a trusted builtin decomposer (e.g. DataClassDecomposer) and fobs_type to an arbitrary Python class path, causing importlib.import_module() to execute import-time code on the server (full RCE, CVSS 8.8). Fix: - Add BUILTIN_TYPES allowlist in builtin_decomposers.py containing all NVFlare data-classes and enum types that are legitimately deserialized via generic decomposers (DataClassDecomposer / EnumTypeDecomposer). - Introduce _type_name_whitelist in fobs.py, pre-seeded from BUILTIN_TYPES, that gates load_class(type_name) in Packer.unpack(). - register_data_classes() and register_enum_types() automatically add to the whitelist so app-registered types continue to work. - Add public add_type_name_whitelist() API for custom job types that need to be explicitly allowed before deserialization. - reset() intentionally does not clear the whitelist so that types auto-registered during pack() in the same process remain accessible after a reset (preserves existing unit-test round-trip patterns). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
Author
|
Still messed up, closing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a Remote Code Execution (RCE) vulnerability in the FOBS deserialization subsystem reported in NVBug #5968367.
Packer.unpack()validated the decomposer name (__fobs_dc__) againstBUILTIN_DECOMPOSERSbut never validated the attacker-controlled type name (__fobs_type__) before passing it toload_class(), which callsimportlib.import_module(). An authenticated FL participant could craft a malicious payload with a trusted builtin decomposer and an arbitrary Python class path to achieve full RCE on the aggregation server.BUILTIN_TYPESallowlist inbuiltin_decomposers.pycovering all types legitimately used across the NVFlare codebase and examples. ThePacker.unpack()method now validatestype_nameagainst this whitelist (and the runtime-registered_decomposers) before callingload_class().add_type_name_whitelist(*type_names)for users to extend the whitelist with custom types at runtime.Security Impact
AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H)__fobs_dc__set to a trusted builtin decomposer and__fobs_type__set to any importable Python class pathTest plan
__fobs_type__raisesValueError__fobs_type__: "subprocess.Popen")🤖 Generated with Claude Code