Skip to content

[2.7] Fix FOBS deserialization RCE via type_name whitelist (NVBug #5968367)#4294

Open
nvidianz wants to merge 190 commits intoNVIDIA:mainfrom
nvidianz:fix_rce_bug_in_fobs
Open

[2.7] Fix FOBS deserialization RCE via type_name whitelist (NVBug #5968367)#4294
nvidianz wants to merge 190 commits intoNVIDIA:mainfrom
nvidianz:fix_rce_bug_in_fobs

Conversation

@nvidianz
Copy link
Collaborator

Summary

Fixes a Remote Code Execution (RCE) vulnerability in the FOBS deserialization subsystem reported in NVBug #5968367.

  • Root cause: Packer.unpack() validated the decomposer name (__fobs_dc__) against BUILTIN_DECOMPOSERS but never validated the attacker-controlled type name (__fobs_type__) before passing it to load_class(), which calls importlib.import_module(). An authenticated FL participant could craft a malicious payload with a trusted builtin decomposer and an arbitrary Python class path to achieve full RCE on the aggregation server.
  • Fix: Added a BUILTIN_TYPES allowlist in builtin_decomposers.py covering all types legitimately used across the NVFlare codebase and examples. The Packer.unpack() method now validates type_name against this whitelist (and the runtime-registered _decomposers) before calling load_class().
  • Public API: Added add_type_name_whitelist(*type_names) for users to extend the whitelist with custom types at runtime.

Security Impact

  • CVE severity: CVSS v3.1 8.8 High (AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H)
  • Attack vector: Authenticated FL participant sends crafted FOBS message with __fobs_dc__ set to a trusted builtin decomposer and __fobs_type__ set to any importable Python class path
  • Impact: Full RCE on the NVFlare aggregation server

Test plan

  • Verify that deserializing a legitimate FOBS message with a whitelisted type succeeds
  • Verify that deserializing a crafted FOBS message with a non-whitelisted __fobs_type__ raises ValueError
  • Verify the PoC from NVBug #5968367 is blocked (payload with __fobs_type__: "subprocess.Popen")
  • Run existing FOBS unit tests to confirm no regressions

🤖 Generated with Claude Code

holgerroth and others added 30 commits December 16, 2025 13:13
NVIDIA#3896)

Fixes # .

### Description

Apply changes from NVIDIA#3878

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Allow only one project admin
Disable patching role by creator

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
### Description

Cherry pick fixes from main
NVIDIA#3882
NVIDIA#3904

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com>
### Description

Changed provision to use http as default scheme.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
### Description

Backported Downloader rework and bug fixes frommain.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com>
### Description

Cherry-pick of PR NVIDIA#3898 

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes # .

### Description

Changed all info level log to debug in ViaDownloader

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…IDIA#3911)

Convert PSI example from JSON configs to use recipe

- Converts PSI example from JSON configs to use recipe
- Adds DhPSIRecipe

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Fixes # .

### Description

A few sentences describing the changes proposed in this pull request.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # .

### Description

Updated CIFAR-10 PyTorch examples to use the Recipe API and Client API
for running federated learning jobs, including easy setup of experiment
tracking and homomorphic encryption with MLFlow in real-world setting.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Signed-off-by: Holger Roth <hroth@nvidia.com>
Fixes # .

### Description

- Restructure CIFAR-10 PT example to separate the client training
scripts for each algorithm.
- Add additional documentation. 
- Rerun experiments with fixed model initialization seed.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Signed-off-by: Holger Roth <hroth@nvidia.com>
Fixes # .

### Description

Refactor CIFAR-10 TensorFlow example to use recipe-style jobs.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Signed-off-by: Holger Roth <hroth@nvidia.com>
…BAnalyticsReceiver (NVIDIA#3918)

Cherry pick consolidate BaseFedJob and fedavg.py and Fix import error
for TBAnalyticsReceiver.

### Description

Cherry pick consolidate BaseFedJob and fedavg.py and Fix import error
for TBAnalyticsReceiver.

Dependencies in the second one made it complicated to do separately, so
cherry picking them together.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Cherry-pick NVIDIA#3906 and NVIDIA#3915 and NVIDIA#3921

### Description
Cherry-pick NVIDIA#3906 and NVIDIA#3915 and NVIDIA#3921

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: nvkevlu <55759229+nvkevlu@users.noreply.github.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
…#3929)

Preflight check tool was hardcoded to test against GRPC communication
scheme.
We have added more schemes and now our default is HTTP so we should
change accordingly

- Update preflight check
- Remove overseer test
- Update preflight check tests

<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Fixes # .

### Description

A few sentences describing the changes proposed in this pull request.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # .

### Description

Add missing `load_cifar10_with_retry` function.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes # .

### Description

Apply same updates for KM example to 2.7

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes # .

### Description

The documentation now correctly positions this as a testing/development
example that demonstrates the limitations of running server and client
together in a time-limited SLURM job, while hinting that production
deployments would require a different architecture (standalone server
with resident clients).

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # .

### Description

Add missing tensorboard requirements to hello-world examples.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Cherry pick cross site eval NVIDIA#3923 

### Description

Cherry pick cross site eval NVIDIA#3923 .

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes # .

### Description

Improve error message in client API if `flare.init()` hasn't been
called.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
### Description

1. Removed references to wss as a scheme/protocol. It's only used
internally by aiohttp library.
2. Increased the downloader timeout in decomposers to 60s. It was
causing download failures for large models.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Cherry pick NVIDIA#3930 NVIDIA#3945

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
These 2 examples are the same, consolidate them.

### Description

These 2 examples are the same, consolidate them.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Fixes # .

### Description

A few sentences describing the changes proposed in this pull request.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # .

### Description

- Expose launch_once option in ScriptRunner to be used by recipes.
- Adds new unit tests.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Cherry-picked from commit 7e87abc (PR NVIDIA#3942 )

### Description

Simplifies the cross-site evaluation (CSE) API by removing unnecessary
parameters, adding framework auto-detection, and auto-configuring
validators. This change dramatically reduces the cognitive load for
users adding CSE to their federated learning workflows.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes # .

### Description

Fix the central TensorBoard logging path to be consistent with federated
examples

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes # .

### Description

Convert bionemo examples to use FedAvgRecipe

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Signed-off-by: Holger Roth <hroth@nvidia.com>
Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Fixes # .

### Description

Apply LLM updates to 2.7 branch

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes # .

### Description

GNN updates to 2.7

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…IA#3970)

Fixes # .

### Description

A few sentences describing the changes proposed in this pull request.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
chesterxgchen and others added 29 commits February 25, 2026 03:46
## Issues addressed
This PR addresses the reported recipe bug list:

- `BUG-8` (Critical): `CyclicRecipe` `model=None + initial_ckpt` created
phantom persistor reference and runtime failure.
- `BUG-1` (Critical): `CyclicRecipe` dict model config caused
`TypeError`.
- `BUG-7` (Critical): base `CyclicRecipe` silently ignored
`initial_ckpt`.
- `BUG-9` (High): `FedAvgRecipe` `per_site_config[\"server\"]` caused
invalid target handling/crash.
- `BUG-4` (High): `Recipe.execute/export` mutated job state and made
recipe reuse unsafe.
- `BUG-2` (Medium): `FedAvgRecipe` per-site `or` fallback ignored falsy
overrides.
- `BUG-3` (Medium): `SimEnv` with `num_clients=0` and client list
resolved `num_threads=0`.
- `BUG-5` (Low): base `FedAvgRecipe` accepted `initial_ckpt` +
`model=None` but produced incomplete model source.
- `BUG-6` (Low): `add_cross_site_evaluation` idempotency guard depended
only on transient `_cse_added` attribute.

## Changes
- Hardened base `CyclicRecipe` model/persistor setup:
  - fail-fast when a valid persistor cannot be configured;
  - support dict model config for supported frameworks;
- apply/validate `initial_ckpt` for wrapper and framework-specific
paths.
- Hardened base `FedAvgRecipe`:
- validate and reject reserved `per_site_config` targets (`server`,
`@ALL`);
- preserve falsy per-site override values via explicit `is not None`
fallback;
  - fail-fast when no persistor and no model params are available.
- Added shared framework persistor setup utility in recipe utils and
wired both Cyclic/FedAvg paths to it.
- Made `Recipe.execute()` and `Recipe.export()` reusable-safe by
snapshot/restore of temporary execution params.
- Fixed `SimEnv` default resolution for `num_clients`/`num_threads` when
explicit client list is provided.
- Hardened CSE idempotency with workflow-based detection in addition to
`_cse_added` fast-path.
- Updated impacted examples and exports:
  - `hello-cyclic` now passes `min_clients` explicitly;
  - `hello-numpy` uses transfer-type API;
  - recipe module exports include `add_cross_site_evaluation`.
- Expanded recipe unit tests with targeted regressions for all above
behaviors.

## Reason for these changes
These updates enforce clear model-source contracts and remove
silent/implicit fallback behavior that could generate invalid jobs or
runtime crashes. The goal is deterministic recipe behavior, safe object
reuse, and predictable per-site/CSE configuration semantics in 2.7.

## Affected files
Core implementation:
- `nvflare/recipe/cyclic.py`
- `nvflare/recipe/fedavg.py`
- `nvflare/recipe/spec.py`
- `nvflare/recipe/sim_env.py`
- `nvflare/recipe/utils.py`
- `nvflare/recipe/poc_env.py`
- `nvflare/recipe/__init__.py`

Framework wrappers / examples:
- `nvflare/app_opt/pt/recipes/cyclic.py`
- `nvflare/app_opt/tf/recipes/cyclic.py`
- `examples/hello-world/hello-cyclic/job.py`
- `examples/hello-world/hello-numpy/job.py`
- `examples/tutorials/job_recipe.ipynb`

Tests:
- `tests/unit_test/recipe/cyclic_recipe_test.py`
- `tests/unit_test/recipe/fedavg_recipe_test.py`
- `tests/unit_test/recipe/spec_test.py`
- `tests/unit_test/recipe/sim_env_test.py`
- `tests/unit_test/recipe/utils_test.py`

## Test strategy
- Targeted regression suites:
- `pytest -q tests/unit_test/recipe/cyclic_recipe_test.py
tests/unit_test/recipe/fedavg_recipe_test.py
tests/unit_test/recipe/spec_test.py`
- `pytest -q tests/unit_test/recipe/sim_env_test.py
tests/unit_test/recipe/utils_test.py`
- Full recipe suite:
  - `pytest -q tests/unit_test/recipe`
- Result: recipe suite passed (`183 passed, 13 skipped`).

Made with [Cursor](https://cursor.com)
…e enhancements (NVIDIA#4225)

## Summary

This PR addresses several related bugs and improvements in the Client
API subprocess path, swarm learning, and the recipe interface.
Address FLARE-2743 (bug/5921094) and (bug/5921708)

---

## Issues

### 1. LazyDownloadRef dtype error with pass-through streaming
When pass-through tensor streaming is enabled on the CJ, param
converters (e.g. `NumpyToPTParamsConverter`) were running on the CJ
*before* lazy references were materialized. This caused `"Could not
infer dtype of LazyDownloadRef"` warnings and potential data corruption
on large-model jobs using pass-through transfer.

### 2. Subprocess Client API had no converter wiring
The `ExProcessClientAPI` (subprocess path) created a
`FlareAgentWithFLModel` with no converters attached. Format conversion
was happening on the CJ side in `LauncherExecutor.execute()`, which
triggered issue #1.

### 3. Missing `learn_task_check_interval` in `CCWFJob.add_swarm()`
`SwarmClientConfig.learn_task_check_interval` was not forwarded to the
swarm controller, so any customization of that parameter was silently
ignored.

### 4. `extract_participants()` did not support `{"targets": [...]}`
dict form
The deploy map validation only handled list-form and `{"sites": [...]}`
dict-form entries. The `{"targets": [...]}` form (used by some job
generators) raised an unhandled error.

### 5. `SimpleSwarmLearningRecipe` missing parameters
`min_clients`, `launch_external_process`, and `command` could not be set
through the recipe interface, forcing users to construct jobs manually.

### 6. 3rd-party integration docs referenced deprecated
`FlareAgentWithCellPipe`
The class has been superseded by constructing a `CellPipe` and
`FlareAgent` directly, or using the higher-level Client API
(`nvflare.client`).

---

## Issues fixed from review comments

### 7. `task_name` read from absent shareable header in
`FlareAgentWithFLModel.shareable_to_task_data()`
The task name was read via `shareable.get_header(FLContextKey.TASK_NAME,
"")`, but this header is never populated in the subprocess path — the
task name is carried as the pipe `Message.topic` and stored in
`self.current_task.task_name` by `FlareAgent._on_task()`. An absent
header caused `task_name` to fall back to `""`, making
`ParamsConverter.process()` silently skip conversion. Fixed to use
`self.current_task.task_name if self.current_task else ""`, consistent
with `task_result_to_shareable()`.

### 8. Double-conversion when `from_nvflare_converter_id` /
`to_nvflare_converter_id` set on launcher executor
`LauncherExecutor.execute()` still applied CJ-side converters if users
explicitly set `from_nvflare_converter_id` / `to_nvflare_converter_id`,
while the subprocess was also applying its own default converters —
resulting in double conversion. Fixed by removing the converter calls
from `LauncherExecutor.execute()` entirely. Since `LauncherExecutor` is
only ever subclassed by `ClientAPILauncherExecutor` (subprocess path),
this has no effect on the in-process path.

---

## Approach

**Converter logic removed from CJ launcher executors; moved to the
subprocess agent side.**

Previously, `PTClientAPILauncherExecutor` and
`TFClientAPILauncherExecutor` set up format converters in their
`initialize()` methods, and `LauncherExecutor.execute()` applied them on
the CJ before dispatching to the subprocess. This was the root cause of
issue #1 — the CJ was touching tensor data (triggering dtype inference)
before `LazyDownloadRef` placeholders were materialized in the
subprocess.

The converter setup blocks have been **completely removed** from both
launcher executors. Converters for the subprocess path now live entirely
inside the subprocess, wired into `FlareAgentWithFLModel` and executing
after full payload materialization.

Key changes:
- `FlareAgentWithFLModel` gains optional `from_nvflare_converter` /
`to_nvflare_converter` params. A lightweight `_ConverterContext` stub
(duck-typing `FLContext.get_prop/set_prop`) allows
`ParamsConverter.process()` to work without a real `FLContext` in the
subprocess.
- `SERVER_EXPECTED_FORMAT` is added to `ConfigKey` and propagated in the
subprocess launch config so the agent knows what format the server
expects.
- A shared factory `converter_utils.create_default_params_converters()`
is introduced, replacing four separate copies of the same
format-conditional logic across PT and TF executors.
- In-process executors (`PTInProcessClientAPIExecutor`,
`TFInProcessClientAPIExecutor`) are **unchanged in behavior** — they
still apply converters on the CJ side, which is correct since there is
no separate subprocess.

---

## Reason for this approach over alternatives

An earlier iteration kept the converter setup in `initialize()` but
added a `ClientAPILauncherExecutor.execute()` override that nulled the
converters out before calling `super().execute()` and restored them
after. This was removed because it left converters alive on the object
but always bypassed — confusing dead code. The current approach is
explicit and honest: launcher executors do not set converters at all,
and the subprocess handles its own conversion autonomously.

---

## Affected files

| Area | Files |
|------|-------|
| Core bug fixes | `ccwf_job.py`, `fed_utils.py` |
| Converter framework | `converter_utils.py` (new), `config.py`,
`flare_agent_with_fl_model.py`, `ex_process/api.py`,
`client_api_launcher_executor.py` (base) |
| PT launcher executor — converter removed |
`pt/client_api_launcher_executor.py` |
| PT in-process executor — refactored to shared factory |
`pt/in_process_client_api_executor.py` |
| TF launcher executor — converter removed + `initialize()` deleted |
`tf/client_api_launcher_executor.py` |
| TF in-process executor — refactored to shared factory |
`tf/in_process_client_api_executor.py` |
| Recipe | `pt/recipes/swarm.py` |
| Docs | `3rd_party_integration.rst`, `3rd_party_trainer.py` |
| Tests | 4 new test files, 4 modified test files, 2 integration YAMLs |

---

## Trade-offs and future work

- **TF/Keras converter path is not directly unit-tested** due to
TF/PyTorch dependency conflicts in a shared environment. The
`create_default_params_converters` Keras branch is structurally
identical to the PT branch (which is fully tested). A dedicated TF-only
test environment could add coverage in a future pass.
- **`_ConverterContext` is a duck-type shim.** If more `ParamsConverter`
use cases arise in subprocess contexts, a formal lightweight interface
(separate from `FLContext`) would be worth extracting.
- **Converter plugin (`from_nvflare_converter_id`)** for the subprocess
path is currently not wired — only the auto-created default converters
are passed to the agent. Explicit user-provided converter IDs are still
only supported for the in-process path. This can be addressed in a
follow-up.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
### Description

Cherry pick of PR NVIDIA#4229.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [x] New tests added to cover the changes.
- [x] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…ctive client exclusion, and dead-detection debounce (NVIDIA#4209)

## Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays
on Frontier) abort in
Round 0 due to a cascading startup failure chain.  The root sequence is:

1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from
relay-connected clients
2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` —
not a failure — so
   timed-out clients silently appear to have been deployed
3. **`_start_run()`** tries to start those clients; they again time out,
and
   `check_client_replies()` ignores the `None` reply
4. **`_sync_client_jobs()`** fires dead-job notification on the very
first heartbeat with
   no startup grace period
5. FedAvg requires 144/144 — one or two missing clients → abort
6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not
iterable` when
`get_job_clients()` receives `None` metadata from an already-aborted job

PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in
2.7.2) address the
transport layer.  This PR addresses the remaining job lifecycle layer.

---

## Fixes Included

### 1 — `_deploy_job()`: Treat deployment timeout as failure
(`job_runner.py`)

**Root bug**: `reply=None` was logged as `"unknown"` and excluded from
`failed_clients`,
so timed-out clients counted as "successfully deployed" for the
`min_sites` check.

**Fix**: Add timed-out clients to `failed_clients` with a `"deployment
timeout"` label.
The existing `min_sites` / `required_sites` logic then correctly decides
whether to abort.

### 2 — `check_client_replies()`: Return timed-out clients instead of
raising (`admin.py`)

**Root bug**: In strict mode, any timeout raised immediately, aborting
the whole job even
when the remaining active clients satisfied `min_sites`.

**Fix**: In strict mode, collect timed-out clients into a return list
rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also
fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional
`zip()`.

New signature: `check_client_replies(...) -> List[str]` (timed-out
client names; empty = none).

### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation
(`job_runner.py`)

**Root bug**: A start-job timeout under strict mode aborted the entire
job with no
tolerance for stragglers within `min_sites` bounds.

**Fix**: Use the returned timed-out list from `check_client_replies()`.
If remaining
active clients >= `min_sites`, log a warning and proceed. Only abort
when below tolerance.

### 4 — `_sync_client_jobs()`: Require-prior-report default changed to
`True` (`fed_server.py`)

**Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to
`False`, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

**Fix**: Default changed to `True`. Operators who want the aggressive
legacy detection
can set it to `False` explicitly.

### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of
`job_info` dict (`fed_server.py`)

**Root bug**: Positive-observation tracking was stored as
`job_info["_reported_clients"]`,
injecting algorithm state into a data dict with no corresponding
`RunProcessKey` constant.

**Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]`
on `FederatedServer`.
Stale entries are purged whenever a job is no longer in `run_processes`.

### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation
(`client_run_manager.py`)

Raises `RuntimeError` with a descriptive message instead of an opaque
`TypeError` when
`JOB_CLIENTS` is absent or the wrong type.

---

## Configuration Recommendations (No Code Change Needed)

| Setting | Recommended value | Effect |
|---|---|---|
| `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few
startup stragglers |
| `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments
time to complete |
| `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced,
straggler clients excluded |
| `sync_client_jobs_require_previous_report` | `true` (now the default)
| Prevents premature dead-job from startup delay |
| `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging |
Disconnects stalled relay connections |

---

## Files Changed

- `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout
as failure; `_start_run()` selective exclusion
- `nvflare/private/fed/server/admin.py` — `check_client_replies()`
returns timed-out list; dict-keyed non-strict path
- `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()`
default `True`; `_job_reported_clients` attr; stale cleanup
- `nvflare/private/fed/client/client_run_manager.py` — explicit meta
validation in `get_job_clients()`

---

## Test Coverage

New and updated unit tests with both positive and negative cases:

| File | Tests | What they cover |
|---|---|---|
| `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error
still raises; reorder OK |
| `job_runner_test.py` | 4 | strict flag wiring; timeout within
tolerance → warn; timeout below tolerance → raise |
| `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as
failure; OK reply not failed; mixed outcomes; detail label; min_sites
with timeouts; integration sequence |
| `fed_server_test.py` | 5 | Default requires-prior-report; legacy
explicit-False still fires; tracking in server attr not job_info; stale
cleanup |

All 29 targeted unit tests pass.

## Test Plan

- [x] Unit tests for each changed function (positive + negative)
- [x] New `job_runner_deploy_test.py` covering deployment timeout
classification end-to-end
- [x] All 29 targeted unit tests pass
- [ ] Hierarchical staging run with all flags at default
- [ ] Hierarchical staging run with `strict_start_job_reply_check=true`
and reduced `min_clients`
- [ ] Verify no regression on standard (non-hierarchical) FL jobs

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…VIDIA#4231)

## Problem Statement

Three related issues were discovered in the NVFlare client-side stack:

### Issue 1: RSS Memory Growth Across FL Rounds

After each federated learning round, the client process RSS grew
monotonically. Model parameters — both the received global model and the
locally trained model — were kept alive long after `flare.send()`
completed. The reference chain kept them live until the next `receive()`
overwrote `self.fl_model`, meaning one full round's worth of tensors was
always double-buffered in memory.

### Issue 2: Race Condition in `PipeHandler.send_to_peer()` — Premature
`delete_msg_root`

The original `send_to_peer()` had:

```python
finally:
    # the msg_id is also used as msg_root_id
    delete_msg_root(msg.msg_id)
```

This was a race condition. `CellPipe.send_request()` returns as soon as
the peer acknowledges receipt via `make_reply(ReturnCode.OK)` in
`_receive_message` — which happens **before** the peer finishes
downloading model params. The `finally` block fired
`delete_msg_root(msg.msg_id)` and destroyed the active download
transaction while the CJ was still pulling tensors from the subprocess,
causing intermittent download failures.

### Issue 3: Silent Logging Gap in Subprocess Mode

When `launch_external_process=True`, `SubprocessLauncher` spawns
`python3 -u client.py`. This new process starts with Python's
unconfigured logging system — `lastResort` only emits `WARNING+`, so
every `logger.info()` inside the subprocess is silently dropped. All
INFO-level diagnostics (NVFlare internals, user training progress) were
invisible in subprocess mode.

---

## Approach and Rationale

### Fix 1: Release Model Params Eagerly After `send()` — commit
`39eb631d`

**Files:** `nvflare/client/in_process/api.py`,
`nvflare/client/ex_process/api.py`, `nvflare/client/model_registry.py`

After `flare.send()`, both the sent model's params and the received
global model's params are dead weight — already serialized and
transmitted. We now explicitly null them immediately when
`clear_cache=True` (the default):

- **`InProcessClientAPI`**: Nulls `model.params`,
`model.optimizer_params`, `self.fl_model.params`,
`self.fl_model.optimizer_params` before clearing `self.fl_model`.
- **`ExProcessClientAPI`**: Calls `model_registry.release_params(model)`
which nulls both the sent and received model's params.

**Why not `gc.collect()`?** CPython's reference counting reclaims
objects as soon as their refcount drops to zero — `gc.collect()` only
helps with reference *cycles*. PyTorch tensors don't form cycles. Once
the params dicts are nulled, tensor memory is freed immediately.
Confirmed experimentally: RSS stays flat without `gc.collect()`.

**Why null params specifically?** Params (the large tensor dicts) are
the dominant memory consumer. Nulling them frees the tensors while
keeping the `FLModel` shell (metadata, metrics, round) alive for any
code that reads those fields after `send()`.

### Fix 2: Remove Premature `delete_msg_root` from
`PipeHandler.send_to_peer()` — commit `39eb631d`

**File:** `nvflare/fuel/utils/pipe/pipe_handler.py`

Removed the `finally: delete_msg_root(msg.msg_id)` block and its import.
The `msg_id` doubles as the `msg_root_id` used by
`ViaDownloaderDecomposer` to track download transaction lifecycle.
Calling `delete_msg_root` in a `finally` block races against the peer's
download: the ack is immediate but the actual tensor download is
asynchronous. Download transaction cleanup now correctly happens in
`wf_comm_server.py` and `task_controller.py`, which call
`delete_msg_root` only after the task is fully processed server-side.

### Fix 3: Configure Subprocess Python Logging on `flare.init()` —
commit `bc6d87847`

**File:** `nvflare/client/ex_process/api.py` — added
`_configure_subprocess_logging()`

Immediately after loading `client_api_config.json`, `init()` calls
`_configure_subprocess_logging(client_config)`, which:

1. Reads `workspace_dir` from `client_api_config.json` under
`TASK_EXCHANGE.pipe.ARG.workspace_dir`
2. Loads `{workspace_dir}/local/log_config.json` (the site's standard
NVFlare log config)
3. Calls `apply_log_config(dict_config, workspace_dir)` to invoke
`logging.config.dictConfig()`

This wires up `consoleHandler -> sys.stdout` (captured by
`SubprocessLauncher`), `logFileHandler -> log.txt`, and `jsonFileHandler
-> log.json` — exactly as the main NVFlare process configures them. The
entire method is wrapped in `try/except Exception: pass` — a logging
setup failure must never crash the training script.

Called before the `try` block in `init()` so that even errors during
`flare_agent.start()` are properly logged.

### Fix 4: RSS Memory Profiling Utility — commit `bc6d87847`

**New file:** `nvflare/fuel/utils/mem_utils.py`

Zero-cost diagnostic to measure client-side RSS per round:

- Gated by `NVFLARE_CLIENT_MEMORY_PROFILE=1` (off by default). Name is
client-scoped so server/relay processes are unaffected.
- `_ENABLED` evaluated once at import time — single boolean check per
call, zero overhead when disabled.
- Uses `logger.info()`: after Fix 3, works correctly in both modes. RSS
lines land in `log.txt`/`log.json` alongside all other NVFlare
diagnostics.
- `psutil` imported lazily; entire call is `try/except`-wrapped so
missing install is a silent no-op.
- Instrumented at `receive()` and `send()` in both `InProcessClientAPI`
and `ExProcessClientAPI`.

---

## Files Changed

| File | Change |
|------|--------|
| `nvflare/client/api_spec.py` | `_maybe_cleanup_memory()` base
implementation and memory management attributes |
| `nvflare/client/in_process/api.py` | Eager param release in `send()`,
`log_rss()` instrumentation, `configure_memory_management()` |
| `nvflare/client/ex_process/api.py` | Eager param release,
`_configure_subprocess_logging()`, `log_rss()` instrumentation, memory
GC config |
| `nvflare/client/model_registry.py` | Added `release_params()` method |
| `nvflare/fuel/utils/pipe/pipe_handler.py` | Removed racy `finally:
delete_msg_root(msg.msg_id)` and its import |
| `nvflare/fuel/utils/mem_utils.py` | **New file** — `log_rss(tag)` RSS
profiling utility |
| `docs/client_api.rst` | Clarified param lifecycle and `clear_cache`
semantics |

---

## Verification

Tested with `hello-pt` (CIFAR-10, 2 clients, 2 rounds) using
`NVFLARE_CLIENT_MEMORY_PROFILE=1`:

**In-process mode:**
```
round=0 after_receive: 629.2 MB
round=0 after_send:    270.9 MB   <- params freed immediately
round=1 after_receive: 283.4 MB
round=1 after_send:    448.5 MB   <- flat, no accumulation
```

**Subprocess mode (`launch_external_process=True`):**
```
round=0 after_receive: 623.5 MB
round=0 after_send:    129.4 MB   <- params freed immediately
round=1 after_receive: 136.1 MB
round=1 after_send:    121.5 MB   <- flat, no accumulation
```

RSS stabilizes after round 0 in both modes. Subprocess logging
confirmed: `logger.info()` lines now appear in `site-N/log.txt` (written
directly by the subprocess) and captured via `SubprocessLauncher`.

## Test plan
- [x] In-process RSS profiling with `NVFLARE_CLIENT_MEMORY_PROFILE=1` —
`[RSS]` lines in `site-N/log.txt`
- [x] Subprocess RSS profiling with `launch_external_process=True` —
`[RSS]` lines in both subprocess log and `SubprocessLauncher` capture
- [x] Subprocess logging gap fixed — `logger.info()` calls now visible
(previously silently dropped)
- [x] Memory flat across rounds in both modes (no monotonic growth)
- [x] `flare.send(clear_cache=False)` path untouched — params only
released when `clear_cache=True`
- [x] `delete_msg_root` race removed — download transactions no longer
prematurely destroyed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ent, hierarchical startup stability [skip ci] (NVIDIA#4218)

## Merge Dependency

> ⚠️ **Depends on NVIDIA#4209** — The *Hierarchical FL Startup Stability*
section documents changes introduced by PR NVIDIA#4209 (currently open).
**Merge PR NVIDIA#4209 into `2.7` before merging this PR.** All other sections
cover already-merged PRs.

---

## Summary

This PR updates `docs/release_notes/flare_272.rst` to reflect all major
changes merged into the 2.7.x line after the initial 2.7.2 draft,
covering three new areas:

- **Memory Management** — restructured and expanded with Zero Tensor
Copy at CJ (PR NVIDIA#4210) and client-side memory lifecycle management (PR
NVIDIA#4211)
- **F3 Streaming Reliability and Performance** — new section covering
HOL stall mitigation (PR NVIDIA#4206), stream pool starvation fix (PR
NVIDIA#4171/NVIDIA#4172), streaming download retry (PR NVIDIA#4167), RxTask self-deadlock
fix (PR NVIDIA#4204), and lock contention reduction (PR NVIDIA#4174)
- **Hierarchical FL Startup Stability** — new section covering
deployment timeout classification, startup grace period, selective
client exclusion, and metadata hardening (PR NVIDIA#4209 — pending merge),
with recommended config snippets for HPC/Lustre environments

The Bug Fixes section and intro paragraph are also updated accordingly.

A source-level RST comment has been added above the Hierarchical FL
section in the file to alert future maintainers to the merge dependency.

## Merged PRs Documented

| PR | Area | Status |
|---|---|---|
| NVIDIA#4171 / NVIDIA#4172 | Stream pool starvation fix | Merged |
| NVIDIA#4174 | Lock contention reduction | Merged |
| NVIDIA#4167 | Streaming download retry | Merged |
| NVIDIA#4204 | RxTask self-deadlock fix | Merged |
| NVIDIA#4206 | HOL stall mitigation | Merged |
| NVIDIA#4210 | Zero tensor copy at CJ | Merged |
| NVIDIA#4211 | Client-side memory management | Merged |
| NVIDIA#4209 | Hierarchical FL startup stability | **Open — merge before this
PR** |

## Changes

### Memory Management (restructured)

- **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds
`LazyDownloadRef` placeholders instead of materializing full tensors,
eliminating the CJ as a memory bottleneck for LLM-scale models.
- **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` /
jemalloc purge / `torch.cuda.empty_cache()` injected after every
`flare.send()`, configurable via `client_memory_gc_rounds`.
- Existing TensorDownloader and server-side cleanup content retained.

### F3 Streaming Reliability and Performance (new section)

- **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress
watchdog, and stall detection/recovery. Includes recommended environment
variable settings for large hierarchical deployments.
- **Stream Pool Starvation Fix**: Blob callbacks dispatched to a
dedicated `callback_thread_pool`, keeping stream workers free for
concurrent downloads.
- **Streaming Download Retry**: Exponential-backoff retry (up to 3
attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware.
- **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock`
released, eliminating stream-error-triggered deadlock.
- **Lock Contention Reduction**: `produce_item()` runs outside
`self.lock`; compare-and-store for cache write. Reduces model-download
latency under high client concurrency.

### Hierarchical FL Startup Stability (new section — pending PR NVIDIA#4209)

- **Deployment Timeout as Failure**: `reply=None` correctly counted
against `min_sites`; timed-out clients excluded before
`start_client_job`.
- **Startup Grace Period**: Dead-client detection debounced — client
must be observed once before absence triggers dead-job notification.
Default changed to `True`.
- **Selective Client Exclusion**: Stragglers at start-job excluded
rather than causing full abort, if remaining count ≥ `min_clients`.
- **Metadata Hardening**: `TypeError` on absent job metadata replaced
with descriptive `RuntimeError`.
- Recommended `config_fed_server.json` / `config_fed_client.json`
snippets for HPC (Frontier/ORNL) scale.

## Test plan

- [ ] Sphinx build (`make html`) passes without RST warnings on the
updated file
- [ ] All new cross-references (`.. code-block::`, `.. note::`) render
correctly in the docs build
- [ ] Verify section hierarchy (underline characters) is consistent
throughout the file
- [ ] Confirm PR NVIDIA#4209 is merged before this PR is merged

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes # .

### Description

## FedOpt: default config_type for optimizer/scheduler, support
class_path

### Why
FedOptRecipe failed at config load when `optimizer_args` /
`lr_scheduler_args` omitted `config_type`: the component builder tried
to instantiate the optimizer (e.g. `torch.optim.SGD`) before the model
existed, so `params` were missing. Users had to remember to set
`config_type: "dict"`. Optimizer/scheduler configs also didn't support
`class_path`, unlike model_config.

### What
- **Recipe utils:** Added `ensure_config_type_dict()` — sets
`config_type: "dict"` when missing and, if only `class_path` is set,
copies it to `path` so the config layer keeps working.
- **FedOptRecipe (PT & TF):** Run optimizer_args and lr_scheduler_args
through `ensure_config_type_dict()` so `config_type` is no longer
required; accept both `path` and `class_path`.
- **PT shareable generator:** Resolve display name from `path` or
`class_path`.
- **Examples/docs:** Use `class_path` for optimizer/scheduler; remove
explicit `config_type` from FedOpt examples; fix cifar10_fedopt README
to use FedOptRecipe and the correct API.
- **Tests:** Cover both `path` and `class_path` for PT and TF; skip PT
tests when PyTorch is missing and TF tests when TensorFlow is missing or
fails to load, so either venv can run the suite without failures.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [x] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [x] In-line docstrings updated.
- [x] Documentation updated.
…NVIDIA#4239)

Fixes # .

### Description

Cherry-pick NVIDIA#4237

Enables consistent use of class_path for all objects configured through
the job API.

ComponentBuilder.get_class_path(): use 'path' or 'class_path' (path
takes precedence)
ComponentBuilder.is_class_config(): treat 'class_path' like 'path'
wfconf get_class_path(): same behavior for workflow/config parsing
Add tests: class_path-only config, path overrides class_path when both
present

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Fixes # .

### Description

Additional enhancements from NVIDIA#4240

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [x] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [x] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…ders (NVIDIA#4243)

Fixes # .

### Description

Cherry-pick NVIDIA#4241

## Summary
- replace the line-based license grep in runtest.sh with
ci/check_license_header.py to validate the full Apache 2.0 header
(year-flexible)
- keep the intended exception for public-domain files and keep existing
exclusions for protos and modeling_roberta.py
- normalize inconsistent license headers in affected files (missing full
block and minor text typos)

## Testing
- python3 ci/check_license_header.py nvflare examples tests integration
research

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…HROUGH, download gating, MSG_ROOT_TTL, LazyRef local-aggr, logging, min_clients, diagnostics (NVIDIA#4247)

## Summary

Comprehensive fix series for large-model federated training with
subprocess clients. Addresses memory accumulation root causes,
correctness crashes, timeout misconfiguration, and operational
diagnostics across 18 numbered fixes.

**Data-flow context** (subprocess-mode per round):
```
Forward path:  FL Server / Aggregator → [scatter] → Trainer CJ → [task delivery] → Subprocess
Reverse path:  Subprocess → [result]  → Trainer CJ → [relay]   → FL Server
```

---

## Problems & Approaches

### Fix 1 (Critical) — Serialize once, cache result on `Message` object
**Problem**: `CellPipe.send()` re-serialized the payload on every retry,
creating a new `ArrayDownloadable` download transaction per attempt.
With unbounded retries on a 5 GiB model, each failed send accumulated a
fresh ~5 GiB transaction, leading to OOM.

**Approach**: Cache the serialized `CellMessage` on the `Message` object
(e.g. `msg._cached_cell_msg`) on the first send. Retries reuse the same
bytes and the same download transaction — no re-serialization, no
transaction accumulation.

**Files**: `nvflare/fuel/utils/pipe/cell_pipe.py`

### Fix 2 — Explicit cache teardown after retry loop
**Problem**: The cached `CellMessage` held references to encoded bytes
and the FOBS download transaction long after the retry loop exited,
delaying garbage collection.

**Approach**: `PipeHandler._send_to_pipe()` calls
`pipe.release_send_cache(msg)` unconditionally in a `try/finally` after
the loop. The transaction is not cancelled on each retry (the peer may
be mid-download); it is released exactly once after the loop terminates.

**Files**: `nvflare/fuel/utils/pipe/pipe_handler.py`

### Fix 3 — `submit_result_timeout` configurable, sized for large-model
transfer
**Problem**: `FlareAgent.submit_result_timeout` defaulted to 60 s — far
too short for a 5 GiB transfer. There was no way to set it from the job
config.

**Approach**: Add `ConfigKey.SUBMIT_RESULT_TIMEOUT` to `ClientConfig`;
wire through `ClientAPILauncherExecutor` → `prepare_config_for_launch()`
→ `FlareAgentWithFLModel` → `FlareAgent`. Operator sets it per job via
`recipe.add_client_config({"submit_result_timeout": 1800})`.

**Files**: `nvflare/client/config.py`,
`nvflare/app_common/executors/client_api_launcher_executor.py`,
`nvflare/client/ex_process/api.py`

### Fix 4 — Release `base_obj` in `clear_cache()` after download
completes
**Problem**: `CacheableObject.transaction_done()` cleared `self.cache`
but not `self.base_obj`, keeping the numpy source array alive after all
receivers had finished downloading from it.

**Approach**: Set `self.base_obj = None` in `transaction_done()` so the
reference is dropped deterministically at transaction completion, not
deferred to GC.

**Files**: `nvflare/fuel/f3/streaming/cacheable.py`

### Fix 5 — `peer_read_timeout` / `submit_result_timeout` alignment
**Problem**: `peer_read_timeout` (CJ waiting for subprocess to read the
task) was 1800 s, but `submit_result_timeout` (subprocess waiting for CJ
to ACK result) defaulted to 60 s — structurally inconsistent. A
large-model send could exhaust `submit_result_timeout` multiple times
before CJ finished downloading.

**Approach**: With Fix 3 making `submit_result_timeout` configurable,
both timeouts are now set consistently at job-config level. Fix 13 adds
a startup warning when they are mismatched.

### Fix 6 — CJ ACKs task receipt before downloading
**Problem**: `CellPipe._receive_message()` returned `ReturnCode.OK` only
after FOBS deserialization completed inline — meaning the subprocess
waited for CJ to finish downloading before receiving the ACK. On slow
networks or large models, the subprocess timed out waiting for ACK while
CJ was still downloading.

**Approach**: Return the pipe-level ACK immediately on receipt, then
decode the payload asynchronously. This decouples the acknowledgment
latency from the download transfer time.

**Files**: `nvflare/fuel/utils/pipe/cell_pipe.py`

### Fix 7 — Gate PASS_THROUGH on `CellPipe` only
**Problem**: `ClientAPILauncherExecutor.initialize()` enabled
PASS_THROUGH regardless of pipe type. For `FilePipe` subprocesses
(third-party integrations), the subprocess cannot resolve cell FQCNs to
pull from the server's `DownloadService` directly — enabling
PASS_THROUGH in that mode silently breaks deserialization.

**Approach**: Check `isinstance(self.pipe, CellPipe)` before enabling
PASS_THROUGH. `FilePipe` mode keeps PASS_THROUGH disabled and CJ
continues to download and re-serialize into the file.

**Files**:
`nvflare/app_common/executors/client_api_launcher_executor.py`

### Fix 8 — Enable reverse PASS_THROUGH (subprocess → CJ → server)
**Problem**: On the result path, CJ decoded the subprocess's result
(downloading ~1 GB of tensors), then re-encoded and re-sent to the
server (another ~1 GB upload). CJ materialized the full model twice with
no benefit — it was a pure relay.

**Approach**: Enable `PASS_THROUGH=True` on the subprocess↔CJ pipe cell.
CJ creates `LazyDownloadRef` objects instead of downloading tensors.
When CJ re-encodes for the server, `LazyDownloadRefDecomposer` re-emits
the subprocess's original `fqcn + ref_id`. The server pulls directly
from the subprocess's `DownloadService` — CJ never materializes the
result tensors. `MSG_ROOT_TTL` is stamped on the outgoing cell message
to keep the subprocess transaction alive for the full download.

**Files**:
`nvflare/app_common/executors/client_api_launcher_executor.py`,
`nvflare/fuel/utils/pipe/cell_pipe.py`

### Fix 9 — `_MIN_DOWNLOAD_TIMEOUT` configurable via job config
**Problem**: `_MIN_DOWNLOAD_TIMEOUT` was a hardcoded 60 s constant. For
70 B models the inter-chunk gap can exceed 60 s on any realistic
network, causing `_monitor_tx` to kill an active transaction
mid-transfer.

**Approach**: Add `ConfigVarName.MIN_DOWNLOAD_TIMEOUT`; read via
`acu.get_positive_float_var()` in `_create_downloader()`, following the
same pattern as `STREAMING_PER_REQUEST_TIMEOUT`. Operator sets
`np_min_download_timeout` or `tensor_min_download_timeout` in job
config. 60 s remains the fallback default.

**Files**: `nvflare/apis/fl_constant.py`,
`nvflare/fuel/utils/fobs/decomposers/via_downloader.py`

### Fix 10 — Bound `max_resends`, wire via job config
**Problem**: `FlareAgent` accepted `max_resends` but it defaulted to
`None` (unlimited) and was not exposed through the job config system.
Unlimited retries with Fix 1's caching are safe, but still produce an
indefinite wait on persistent failures.

**Approach**: Add `ConfigKey.MAX_RESENDS` to `ClientConfig`; wire
through `ClientAPILauncherExecutor` → `prepare_config_for_launch()` →
`FlareAgentWithFLModel` → `FlareAgent` → `PipeHandler`. Default: 3.

**Files**: `nvflare/client/config.py`,
`nvflare/app_common/executors/client_api_launcher_executor.py`,
`nvflare/client/ex_process/api.py`

### Fix 12 — Client-job-side RSS logging and memory GC
**Problem**: Subprocess had `log_rss()` + `cleanup_memory()` via
`APISpec._maybe_cleanup_memory()`, but the client-job (CJ) process had
neither. Per-round RSS growth in CJ was invisible.

**Approach**: Override `check_output_shareable()` in
`ClientAPILauncherExecutor` to call `log_rss()` (gated on
`NVFLARE_CLIENT_MEMORY_PROFILE=1`) and `_maybe_cleanup_cj_memory()`
after each result relay. Reuses the existing `memory_gc_rounds` and
`cuda_empty_cache` config params.

**Files**:
`nvflare/app_common/executors/client_api_launcher_executor.py`

### Fix 12a — Harden subprocess logging config loading, prevent
duplicate file writes
**Problem**: `_configure_subprocess_logging()` hardcoded
`local/log_config.json`, missing `.default`/`.conf`/`.yml` variants
common in provisioned deployments. Subprocess file handlers duplicated
every log line with the parent's rotating log files.

**Approach**: Use
`ConfigFactory.load_config(WorkspaceConstants.LOGGING_CONFIG,
search_dirs=[local_dir])` so all variants are resolved. Strip all
non-console handlers before applying the config — subprocess logs flow
exclusively through `SubprocessLauncher`'s stdout pipe to the parent.

**Files**: `nvflare/client/ex_process/api.py`
**Tests**: `tests/unit_test/client/test_logging_and_rss_tags.py` (6
tests)

### Fix 12b — Compact RSS role tags
**Problem**: RSS markers (`after_receive`, `after_send`, `client_job`)
were verbose and did not identify which process emitted them.

**Approach**: Standardize to `"CA s={site} r={round} recv/send"`
(ClientAgent/subprocess) and `"CJ s={site} t={task} r={round} relay"`
(ClientJob). Short, greppable, and role-attributable.

**Files**: `nvflare/client/ex_process/api.py`,
`nvflare/client/in_process/api.py`,
`nvflare/app_common/executors/client_api_launcher_executor.py`
**Tests**: `tests/unit_test/client/test_logging_and_rss_tags.py` (10
tests)

### Fix 13 — Validate timeout consistency at job start
**Problem**: Timeout mismatches (e.g. `min_download_timeout <
streaming_per_request_timeout`) were silent until a download failed
mid-transfer; the operator had no actionable warning.

**Approach**: `_validate_timeout_config()` called at end of
`initialize()` logs warnings for: `min_download_timeout <
streaming_per_request_timeout`, `submit_result_timeout >
min_download_timeout`, and `max_resends is None`. Uses `log_warning`
(not `raise`) so the job still runs.

**Files**:
`nvflare/app_common/executors/client_api_launcher_executor.py`

### Fix 14 (Critical) — Reverse PASS_THROUGH: per-message header,
per-call FOBS decode context
**Problem**: PASS_THROUGH was enabled globally on the engine cell's FOBS
context. The engine cell handles all CJ messages — not just task-result
messages from the subprocess, but also Swarm P2P aggregation results
from other trainers. Swarm P2P messages decoded with `PASS_THROUGH=True`
produced `LazyDownloadRef` objects passed to aggregation math
(`weights[k] += v`), crashing with a dtype error.

**Approach**:
- `CellPipe.pass_through_on_send=True` (set in
`ExProcessClientAPI.init()`) stamps `MessageHeaderKey.PASS_THROUGH=True`
on outgoing task-result messages from the subprocess.
- `Adapter.call()` reads the header **per-message** and builds a
per-call FOBS decode context. Only messages explicitly stamped with
`PASS_THROUGH=True` are decoded as `LazyDownloadRef`.
- Swarm P2P messages arrive without the header → `PASS_THROUGH=False` in
decode ctx → tensors decoded normally, no crash.

**Reason**: Per-message header is the only safe granularity; a shared
cell FOBS context cannot distinguish result-relay from P2P aggregation
traffic.

**Files**: `nvflare/fuel/f3/cellnet/cell.py`,
`nvflare/fuel/f3/cellnet/defs.py`,
`nvflare/fuel/utils/pipe/cell_pipe.py`,
`nvflare/client/ex_process/api.py`
**Tests**: `tests/unit_test/fuel/f3/cellnet/test_pass_through_header.py`
(11 tests)

### Fix 15 — MSG_ROOT_TTL on Swarm P2P learn-task messages
**Problem**: `broadcast_and_wait()` unconditionally overwrote any
pre-set `MSG_ROOT_TTL` with the ACK timeout (~seconds). Swarm P2P
scatter messages carried this short TTL instead of the full
`learn_task_timeout` (hours), causing `_monitor_tx` to kill large-model
download transactions mid-transfer.

**Approach**:
- `SwarmClientController._scatter()` stamps
`ReservedHeaderKey.MSG_ROOT_TTL = float(learn_task_timeout)` on task
data **before** `deepcopy` so the TTL reaches the cell router.
- `task_controller.broadcast_and_wait()` uses an explicit `is not None`
guard to preserve a pre-set TTL instead of overwriting it.

**Files**: `nvflare/apis/impl/task_controller.py`,
`nvflare/app_common/ccwf/swarm_client_ctl.py`
**Tests**: `tests/unit_test/app_common/ccwf/test_msg_root_ttl.py` (6
tests)

### Fix 16 (Critical) — `DOWNLOAD_COMPLETE_CB` gates subprocess exit on
server download
**Problem**: The subprocess exited as soon as `send_to_peer()` received
the CJ ACK (a sub-second event after Fix 6). Exiting tore down the
subprocess's `DownloadService` while the server was still mid-download
from it — corrupting or truncating the transfer.

**Approach**:
- `FlareAgent._do_submit_result()` registers a `threading.Event` as
`FOBSContextKey.DOWNLOAD_COMPLETE_CB` in the subprocess cell's FOBS
context before serialization, then waits up to
`download_complete_timeout` seconds for the event to fire.
- `via_downloader._create_downloader()` wires `DOWNLOAD_COMPLETE_CB` as
`transaction_done_cb` on `ObjectDownloader`; the event is set when the
server's final chunk ACKs.

**Files**: `nvflare/client/flare_agent.py`,
`nvflare/fuel/utils/fobs/decomposers/via_downloader.py`,
`nvflare/client/config.py`,
`nvflare/app_common/executors/client_api_launcher_executor.py`,
`nvflare/client/ex_process/api.py`
**Tests**: `tests/unit_test/client/test_download_complete_gating.py` (12
tests)

### Fix 17 — Resolve LazyDownloadRefs before local swarm aggregation
**Problem**: When the swarm aggregator and a trainer ran on the same CJ
site, `_local_submit()` bypassed the network path. The result tensors
remained as `LazyDownloadRef` objects and reached
`shareable_to_learnable()` unevaluated. The aggregation function inside
`shareable_to_learnable()` expected real numpy arrays and failed with a
runtime error when it encountered the opaque ref objects.

**Approach**: Force-resolve all `LazyDownloadRef` objects in the local
result before passing it to `_local_submit()`. This mirrors the implicit
resolution that occurs via cell deserialization on the remote path.

**Files**: `nvflare/app_common/ccwf/swarm_client_ctl.py`

### Fix 18 — Forward-path PASS_THROUGH for swarm scatter
(feature-flagged, off by default)
**Problem**: When the aggregator scattered the global model to trainer
CJs, each CJ downloaded the full model (copy #1), deep-copied it for the
local learn thread (copy NVIDIA#2), then streamed it to the subprocess. Peak
CJ RSS ≈ 2× model size during scatter with no algorithmic necessity.

**Approach**: Feature-flag that enables `PASS_THROUGH=True` on the
forward scatter path. With the flag on, the CJ acts as a pure relay — it
never materializes the model tensors, forwarding streaming ref-IDs
directly to the subprocess, which pulls the model directly from the
aggregator's `DownloadService`. Eliminates both copies from CJ memory.

**Reason**: Feature-flagged to allow gradual rollout; correctness on all
Swarm topologies requires validation before enabling by default.

**Files**: `nvflare/app_common/ccwf/swarm_client_ctl.py`,
`nvflare/fuel/f3/cellnet/cell.py`

---

## Infrastructure improvements (committed alongside Fix 18)

| # | Change | Files |
|---|--------|-------|
| A | `min_clients: int = 0` on `ServerSideController` — workflow
tolerates partial dropout; default 0 = all clients required (backward
compatible) | `nvflare/app_common/ccwf/server_ctl.py` |
| B | Subprocess stdout passed through directly via `print()` — no
re-logging wrapper, no added timestamp or prefix (subprocess lines
already carry their own log formatting) |
`nvflare/app_common/launchers/subprocess_launcher.py` |
| C | Split download-gate timing into two log lines: CJ ACK latency +
server download duration | `nvflare/client/flare_agent.py` |
| D | `conn_manager.py` shutdown race: set `stopped=True` before
executor shutdown; wrap `executor.submit()` in `try/except RuntimeError`
| `nvflare/fuel/f3/sfm/conn_manager.py` |
| E | `_Transaction` tracks `start_time` + `total_bytes`;
`transaction_done()` logs elapsed and MB transferred |
`nvflare/fuel/f3/streaming/download_service.py` |
| F | `logging.captureWarnings(True)` in `apply_log_config()` so
`ResourceWarning` / `DeprecationWarning` reach file handlers |
`nvflare/fuel/utils/log_utils.py` |
| G | `try/finally: sess.close()` in `internal_submit_job()` — fix gRPC
session resource leak | `nvflare/tool/job/job_cli.py` |

---

## Post-Review Fixes

The following issues were identified during local PR review and resolved
in follow-up commits.

### R1 — `min_clients` fault-tolerance: simplified to warning-only
(`server_ctl.py`)
**Original approach**: After the configure phase, failed clients were
removed from `participating_clients`, `result_clients`, and
`client_statuses`, a membership-update broadcast was sent, and
`starting_client` was reselected if the original starter had failed (see
R6).

**Rolled back**: The pruning logic was reverted because removing clients
from `participating_clients` mid-startup introduced round-deadlock
scenarios (Gatherer's expected-response set diverged from the live
client set across rounds) and the `starting_client` reselection added
complexity that masked the root fault. The `progress_timeout` sentinel
bug (R5) was also a direct consequence of the pruning code.

**Current behavior**: If `configured_count >= min_clients`, the workflow
proceeds. Failed clients that did not configure are logged as a warning
— they remain in `participating_clients` and may rejoin in a later
round. If `configured_count < min_clients`, the workflow panics as
before.

**Files**: `nvflare/app_common/ccwf/server_ctl.py`

### R2 (Must Fix) — Guard `_validate_timeout_config` imports against
`ImportError` (`client_api_launcher_executor.py`)
**Problem**: Both lazy imports inside `_validate_timeout_config` could
raise `ImportError` in constrained environments, crashing `initialize()`
and aborting the job start.

**Fix**: Imports wrapped in `try/except ImportError` with a
`log_warning` fallback.

### R3 (Should Fix) — Gate defensive `LazyDownloadRef` guard behind
`forward_pass_through` (`swarm_client_ctl.py`)
**Problem**: The guard in `_end_gather()` called
`_from_shareable(aggr_result)` unconditionally on every aggregation
round, deserializing the full aggregated DXO (with all weight data) just
to inspect types — wasted work in the common non-PASS_THROUGH case.

**Fix**: Guard is now gated behind `if self.forward_pass_through`. Also
removed the redundant local `ReservedHeaderKey` import (already at
module level).

### R4 (Nit) — Remove dead `finalize()` pass-through; rename
`_round_count` → `_cj_round_count` (`client_api_launcher_executor.py`)
`finalize()` only called `super().finalize()` and was deleted.
`_round_count` renamed to `_cj_round_count` to avoid a naming collision
with `APISpec._round_count`.

### R5 — Rolled back (was: `progress_timeout` never fires when
`min_clients > 0`)
This issue was a direct consequence of the R1 pruning logic
(`overall_last_progress_time = now` made the sentinel equal to current
time). Resolved by rolling back the pruning in R1 — the
`progress_timeout` sentinel is unaffected by the current warning-only
approach.

### R6 — Rolled back (was: Pruned `starting_client` still used)
This issue existed only when failed clients were pruned from
`participating_clients`. Since pruning was rolled back in R1, the
`starting_client` is always one of the configured participants and no
reselection is needed.

### R7 (CI failure) — `test_swarm_memory_gc` stub missing
`forward_pass_through` attribute
**Problem**: The R4 fix gated the `LazyDownloadRef` guard in
`_end_gather()` behind `if self.forward_pass_through`, but
`_make_controller()` in `test_swarm_memory_gc.py` used `__new__` and
never set the attribute — 7 tests failed with `AttributeError`.

**Fix**: Added `ctrl.forward_pass_through = False` to the stub.

### R8 — `min_clients=None` default in `SwarmServerController` crashes
job init (`swarm_server_ctl.py`)
**Problem**: `SwarmServerController.__init__` defaulted
`min_clients=None` and passed it directly to
`ServerSideController.__init__()`, which performs `if min_clients < 0:`.
In Python 3, `None < 0` raises `TypeError` — crashing job initialization
whenever `SwarmServerController` is instantiated without an explicit
`min_clients` (e.g. from a JSON config that omits the field).

**Fix**: Changed the default to `min_clients: int = 0` to match
`ServerSideController`'s own default.

**Files**: `nvflare/app_common/ccwf/swarm_server_ctl.py`

## Test Strategy

| Fix | Test file | Tests |
|-----|-----------|-------|
| Fix 14 (reverse PASS_THROUGH header) |
`tests/unit_test/fuel/f3/cellnet/test_pass_through_header.py` | 11 |
| Fix 15 (MSG_ROOT_TTL) |
`tests/unit_test/app_common/ccwf/test_msg_root_ttl.py` | 6 |
| Fix 16 (download gating) |
`tests/unit_test/client/test_download_complete_gating.py` | 12 |
| Fix 12a/b (logging + RSS tags) |
`tests/unit_test/client/test_logging_and_rss_tags.py` | 16 |
| Infra B (subprocess log passthrough) |
`tests/unit_test/app_common/launchers/subprocess_launcher_test.py` | 2 |
| Infra A (min_clients fault tolerance) |
`tests/unit_test/app_common/ccwf/test_server_ctl_min_clients.py` | 30 |

All existing unit tests pass; no regressions.



### Clarification: PASS_THROUGH Scope in Swarm P2P

1. Reverse path mirrors forward path: `subprocess -> CJ -> aggregator`.
2. The subprocess sets PASS_THROUGH on its outgoing pipe message, so CJ
receives proxy/lazy refs first.
3. CJ then forwards/submits toward the aggregator in Swarm.
4. In Swarm, aggregator may be remote CJ, same node CJ, or same-process
CJ logic path, but subprocess is always a different process from CJ.
5. Aggregation must operate on resolved arrays (not unresolved lazy
refs).


---

## Latest Commit: Review Feedback (cb85eba)

### Addressed
- **Style**: Rename `pt` → `passthrough` variable in `cell.py`
(YuanTingHsieh review)
- **Style**: Remove `_` prefix on `from_shareable` import in
`swarm_client_ctl.py` (nvidianz review)
- **Lint**: Fix all 19 flake8 F401 unused-import warnings across
integration and unit test files
- **Naming**: Rename `test_via_downloader_fix9.py` →
`test_via_downloader_download_timeout.py`

### Open review items (tracked in `docs/fix_list.md`)
- **YT1**: `LazyDownloadRef` guard in `_end_gather()` — change from
`log_error` + recover to `system_panic` + return (agreed, pending
implementation)
- **YT2**: `MSG_ROOT_TTL > task.timeout` leaves DownloadService
transaction alive after send failure — bounded resource leak, tracked as
follow-up
- **YT**: `min_clients` naming collision (job-scheduler vs
fault-tolerance quorum) — needs design decision before H4/H5 wiring
- **YT**: `progress_timeout` interaction with `min_clients` dropout
tolerance — follow-up item
- **nvidianz**: `_decomposer_prefix()` in PT executor — confirmed used
(base class override for `_validate_timeout_config`), keeping
- **nvidianz**: `transaction_done()` info log verbosity — one log per
completed model download; discussing gating strategy
- **nvidianz**: Consolidate timeout params — review which can be
combined or derived from each other

**31 non-test production files** changed in this PR:

**Core APIs (4)**
- `nvflare/apis/fl_constant.py` — `ConfigVarName.MIN_DOWNLOAD_TIMEOUT`
- `nvflare/apis/impl/task_controller.py` — pre-set `MSG_ROOT_TTL`
preservation
- `nvflare/apis/shareable.py` — `ReservedHeaderKey.PASS_THROUGH`
- `nvflare/fuel/f3/cellnet/defs.py` — `MessageHeaderKey.PASS_THROUGH`

**Swarm / CCWF (5)**
- `nvflare/app_common/ccwf/ccwf_job.py` — `min_clients` wiring
- `nvflare/app_common/ccwf/common.py` — membership update topic
- `nvflare/app_common/ccwf/server_ctl.py` — `min_clients` fault
tolerance (warning-only on partial dropout; pruning rolled back)
- `nvflare/app_common/ccwf/swarm_client_ctl.py` — Fix 15/17/18,
`_resolve_lazy_refs`, forward pass-through
- `nvflare/app_common/ccwf/swarm_server_ctl.py` — `min_clients`
forwarding; default fixed to `int = 0`

**Executors / Launchers (3)**
- `nvflare/app_common/executors/client_api_launcher_executor.py` — Fix
3/7/8/10/12/13/16
- `nvflare/app_common/launchers/subprocess_launcher.py` — direct stdout
passthrough (no re-logging)
- `nvflare/app_opt/pt/client_api_launcher_executor.py` — H2 param
forwarding

**Recipe (1)**
- `nvflare/app_opt/pt/recipes/swarm.py` — recipe params

**Client (5)**
- `nvflare/client/config.py` — `submit_result_timeout`, `max_resends`,
`download_complete_timeout`
- `nvflare/client/constants.py` — config key constants
- `nvflare/client/ex_process/api.py` — Fix 12a/12b/14/16
- `nvflare/client/flare_agent.py` — Fix 16 download gating
- `nvflare/client/in_process/api.py` — Fix 12b RSS tags

**Fuel / Streaming (5)**
- `nvflare/fuel/f3/streaming/cacheable.py` — Fix 4 `base_obj` release
- `nvflare/fuel/f3/streaming/download_service.py` — byte accounting,
transaction logging
- `nvflare/fuel/utils/fobs/__init__.py` —
`FOBSContextKey.DOWNLOAD_COMPLETE_CB`
- `nvflare/fuel/utils/fobs/decomposers/via_downloader.py` — Fix 9/16
- `nvflare/fuel/utils/memory_utils.py` — GC logging

**Fuel / Pipe (3)**
- `nvflare/fuel/utils/pipe/cell_pipe.py` — Fix 1/6/8/14
- `nvflare/fuel/utils/pipe/pipe.py` — `release_send_cache` interface
- `nvflare/fuel/utils/pipe/pipe_handler.py` — Fix 2 cache teardown

**Fuel / Cell & Infra (3)**
- `nvflare/fuel/f3/cellnet/cell.py` — Fix 14 per-message PASS_THROUGH
- `nvflare/fuel/f3/sfm/conn_manager.py` — shutdown race fix
- `nvflare/fuel/utils/log_utils.py` — `captureWarnings`

**Private / Other (2)**
- `nvflare/private/aux_runner.py` — PASS_THROUGH header propagation
- `nvflare/tool/job/job_cli.py` — session leak fix

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Add missing submit_model executor.

### Description

Add missing submit_model executor to fix FLARE-2758. Added test and
ensured that it passed to prevent missing this same path in the future.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes FLARE-2757.

### Description

Cyclic error with serialization, the root cause is not using proper
controller

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…IA#4250)

### Fix SimEnv to be compatible with strict simulator_run API

The simulator_run API originally rejected any call that provided both
named clients and n_clients, since they are redundant — if you've named
your clients, the count is already known.

This PR fixes SimEnv properly instead:

- sim_env.py: Fix SimEnv.deploy() to pass n_clients only when
self.clients is None (i.e., for ALL_SITES expansion). When explicit
client names are given, n_clients is omitted.
- fed_job_test.py: Update unit tests to reflect the restored strict
behavior.

The root cause was that SimEnv always passed both n_clients and clients
to simulator_run. The correct fix is in SimEnv, not in relaxing the API
contract.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…4252)

Cherry-pick of NVIDIA#4224 from `main` into `2.7`.

Includes the metric filtering, warning de-duplication, and related unit
tests.

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Fixes NVIDIA#4244.

### Description

## Problem

When using a dict model config with `SimpleSwarmLearningRecipe`, the
exported `config_fed_client.json` contained an empty `args: {}` for the
persistor's model, losing all constructor arguments (e.g.
`model_name_or_path` for HuggingFace models).

Root cause: introduced in NVIDIA#4130, `_instantiate_model_from_dict` eagerly
instantiated the model from the dict and passed the live `nn.Module` to
`PTFileModelPersistor`. The job serializer (`_get_args`) then
introspected the live object and could not recover the original
constructor args (especially for HuggingFace models where args are
buried in internal config, not stored as plain instance attributes).

## Fix

Remove `_instantiate_model_from_dict` and pass the normalized dict
`{"path": "...", "args": {...}}` directly to `PTFileModelPersistor`. The
persistor already supports dict config and resolves it to an `nn.Module`
at runtime via `instantiate_class()` — no eager instantiation needed at
recipe construction time.

Side benefit: large models (e.g. LLaMA-8B) are no longer loaded into
memory just to export a job config.

## Changes

- `nvflare/app_opt/pt/recipes/swarm.py`: remove
`_instantiate_model_from_dict`, unused `importlib` import, and the
`model_instance` indirection — pass `model` dict directly to
`PTFileModelPersistor`

## Test


`TestSimpleSwarmLearningRecipeExport.test_export_preserves_dict_model_args_in_client_config`
— exports the job and asserts the persistor's model args are present in
the JSON config. This test would have failed before this fix.


### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
Fixes FLARE-2760, bug/5949101, bug/5948918.

### Description

Add ProdEnv info to edge example readme (already have it in
docs/user_guide/data_scientist_guide/job_recipe.rst)

Add model_state check to buff_model_manager

Add timeout arg to buff_recipe

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…NVIDIA#4253)

## Summary
- add a FlowerRecipe version check that enforces `flwr>=1.16,<1.26` with
clear runtime errors
- add unit tests for missing/incompatible/compatible flwr versions
- update Flower docs and hello-flower dependency files to document and
install a compatible flwr range
- align Flower executable error hint with the same requirement

## Testing
- PYTHONDONTWRITEBYTECODE=1 pytest tests/unit_test/recipe/spec_test.py
tests/unit_test/recipe/flower_recipe_test.py -q

## Issue
Fixes Flower (flwr) CLI version incompatibility on `2.7`
(`--federation-config` is not available in newer Flower CLI versions).
…ls [docs] (NVIDIA#4264)

## Summary
- clarify TensorFlow recipe docs that class-instance model input should
be a user-defined subclassed Keras model (or dict config)
- add matching caveats for initial_ckpt guidance when a model is
provided
- fix examples/hello-world/hello-tf/README.md model snippet (it
incorrectly showed PyTorch code)

## Scope
- docs-only changes
- no runtime code changes

## Testing
- not run (documentation updates only)
…idation (NVIDIA#4258)

## Summary
- add prominent TenSEAL provisioning guidance to `FedAvgRecipeWithHE`
docstring, including a concrete code example to generate
`server_context.tenseal` and `client_context.tenseal`
- add a SimEnv preflight validation in
`FedAvgRecipeWithHE.process_env()` that checks required context files
and raises a descriptive error instead of failing later with an opaque
missing-file error
- include a direct docs link in the error message:
https://nvflare.readthedocs.io/en/2.7/programming_guide/provisioning_system.html
- add unit tests covering missing-context failure and successful
validation when both files exist

## Testing
- `TMPDIR=$PWD/.tmp_test PYTHONDONTWRITEBYTECODE=1 pytest -q -p
no:cacheprovider
tests/unit_test/recipe/fedavg_he_context_provisioning_test.py` (passes
with skips when `tenseal` is not installed in the environment)

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Fix: Graceful handling when `executorch` is not installed (ET FedBuff)

## Problem

`ETFedBuffRecipe` and `ETTaskProcessor` imported `executorch` at module
level. Any environment without `executorch` installed would get a raw
`ModuleNotFoundError` deep in the import chain — with no indication of
what to install or why.

## Changes

**`nvflare/edge/models/model.py`**
- Replace top-level `from executorch.exir import to_edge` with
`optional_import(...)` — defers the error to call time so importing
`DeviceModel` no longer requires `executorch`

**`nvflare/edge/simulation/et_task_processor.py`**
- Same pattern for `_load_for_executorch_for_training_from_buffer` and
`get_sgd_optimizer` from `executorch.extension.training`

**`nvflare/edge/tools/et_fed_buff_recipe.py`**
- Add explicit guard in `ETFedBuffRecipe.__init__()`: checks
`importlib.util.find_spec("executorch")` and raises
`ImportError("ETFedBuffRecipe requires executorch. See installation
instructions:
https://pytorch.org/executorch/stable/getting-started-setup.html")` —
fires early with actionable guidance instead of a cryptic traceback

**`tests/unit_test/recipe/edge_recipe_test.py`**
- Add `TestETFedBuffRecipeSimBasic` — basic initialization test,
`@pytest.mark.skipif` when `executorch` is absent
- Add
`TestETFedBuffRecipeWithoutExecutorch.test_raises_import_error_without_executorch`
— mocks `find_spec` to return `None` so the negative test always runs
regardless of environment; asserts `ImportError` with `"executorch"` in
the message

## Notes

- HE is intentionally not supported in `SimEnv` — `PocEnv(use_he=True)`
is the correct local environment for HE workflows (it has full
provisioning that generates TenSEAL context files)


### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
update example readme

### Description
update example readme, remove non-existed links

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… validate hang, round_timeout (NVIDIA#4270)

## Summary

Fixes four bugs in NVFlare 2.7.2 RC12 affecting Swarm Learning with
`launch_external_process=True`
Implementation Reference @GeorgeWang-nv's original PR
NVIDIA#4263
Bug 1  -- different implementation from PR4263
Bug 2 -- accept the implementation approach with small changes from PR
4263
Bug 3 -- use different approach
Bug 4 -- add timeout to swarm API but consolidate the timeouts from 4 to
2 and expose to Swarm Recipes
Bug 5 -- restore `self._max_resends` (private convention); subprocess
logging fixed (see below)
Bug 6 -- guard ConnManager against RuntimeError after shutdown; 4
regression tests added

**Additional (review feedback):**
- Rename `SimpleSwarmLearningRecipe` → `SwarmLearningRecipe` everywhere;
backward-compat alias kept
- Fix docs: add missing required `min_clients` arg to all
`SwarmLearningRecipe` examples (TypeError for users copying doc
examples)
- CSE inventory scan: take LAST "best" key, not first — prevents initial
checkpoint named "best_*" from shadowing trained best model
- Add `pipe_type` / `pipe_root_path` to `SwarmLearningRecipe` so users
can select `FilePipe` without dropping to low-level API



### Bug 1 — Download ref deleted before tensor receivers finish
(`via_downloader.py`)

**Root cause**: `_create_downloader()` subscribed to msg_root deletion
to clean up download transactions. The msg_root is deleted as soon as
blob envelopes are delivered, but `blob_cb` fires asynchronously —
secondary tensor downloads were still in flight when
`delete_transaction()` removed the ref from `_ref_table`, causing `"no
ref found" FATAL_SYSTEM_ERROR` on large models (≥1B parameters).

**Fix**: Remove `subscribe_to_msg_root` from `_create_downloader()`
entirely. Transaction lifecycle is now managed solely by `_monitor_tx()`
in `download_service.py`, which polls `is_finished()` every 5s and
cleans up only after all chunk downloads complete.

### Bug 2 — CSE fails to load local model after ext-process training
(`cse_client_ctl.py`)

**Root cause**: `_prepare_local_model()` called
`submit_model_executor.execute()` which, for
`ClientAPILauncherExecutor`, launches a fresh subprocess with no trained
model state. The training subprocess had already exited; the model was
already saved to disk by `PTFileModelPersistor`.

**Fix**: Try the persistor first. Inventory key scan prefers the LAST
key containing `"best"` when `model_name` contains `"best"` — taking the
last "best" key avoids mistaking an initial checkpoint named `"best_*"`
(added first by `PTFileModelPersistor`) for the trained best model.
`isinstance` guard replaces `assert isinstance` (safe under `python
-O`). Falls back to executor for in-process mode compatibility.

### Bug 3 — Validate results hang 1800s on CSE round 2+
(`flare_agent.py` + `via_downloader.py`)

**Root cause**: `FlareAgent._do_submit_result()` unconditionally waited
1800s for `DOWNLOAD_COMPLETE_CB` after every result send when
`pass_through_on_send=True`. For validate results (metrics only, no
tensors), `_finalize_download_tx()` creates no download transaction and
the callback never fires — subprocess blocks indefinitely.

**Fix**: Thread-local `was_download_initiated()` flag, set by
`_finalize_download_tx()` only when downloadable objects exist. Agent
returns immediately if `False` (validate result). Thread-local is
required because task pipe and metric pipe share the same `CoreCell`
(same `site_name + token + mode` → same FQCN → same `_CellInfo` cache
entry → same `fobs_ctx`); a plain flag would be clobbered by concurrent
metric serialisation from a different thread.

### Bug 4 — P2P model ACK timeout too short for large models
(`swarm.py`)

**Root cause**: `SwarmClientConfig` hardcodes
`learn_task_ack_timeout=10s` and `final_result_ack_timeout=10s`. For
large models (≥2 GB), P2P tensor streaming takes minutes — the ACK times
out before download completes.

**Fix**: Add `round_timeout: float = 3600` to `SwarmLearningRecipe`.
Wires both `learn_task_ack_timeout` and `final_result_ack_timeout`;
`learn_task_timeout` is intentionally left `None` (unbounded) to avoid
capping per-round training time on slow hardware or 70B+ models.

### Bug 5 — `_max_resends` attribute shadowing
(`client_api_launcher_executor.py`)

**Root cause**: `ClientAPILauncherExecutor.__init__()` stored
`max_resends` as `self._max_resends` but `TaskExchanger` (the parent)
uses `self.max_resends`. `PipeHandler` reads `self.max_resends` and saw
`None` (the parent default) instead of the configured value.

**Fix**: Restore `self._max_resends` as the private attribute throughout
`ClientAPILauncherExecutor` (consistent with private-member convention);
the executor builds its own config dict explicitly so it does not rely
on the inherited attribute.

### Subprocess logging fix (`subprocess_launcher.py` +
`ex_process/api.py`)

**Problem**: The subprocess had no logging configuration —
`logger.info()` calls were silently dropped. Wrapping all stdout via
`logger.info()` in the parent caused double-prefixed entries in
`log.txt` for NVFlare-formatted log lines.

**Fix**: `_configure_subprocess_logging()` in `ex_process/api.py` loads
the site's `log_config.json` unchanged, giving the subprocess identical
loggers to the parent (both consoleHandler + file handlers). The
parent's `log_subprocess_output()` now calls `_route_subprocess_line()`
which strips ANSI codes and checks for a `YYYY-MM-DD HH:MM:SS`
timestamp:
- Formatted NVFlare log line → `print()` to terminal only (file handler
already wrote it to `log.txt`)
- Raw `print()` from user training script → `logger.info()` so it
reaches `log.txt`

Regression tests:
`test_log_subprocess_output_formatted_lines_not_double_logged` + 5
`TestRouteSubprocessLine` tests.

### Bug 6 — ConnManager crashes on frame arrival after shutdown
(`conn_manager.py`)

**Root cause**: `process_frame_task()` and `start_connector()` could
raise unhandled `RuntimeError` when called after the executor was shut
down, causing noisy tracebacks during job teardown.

**Fix**: Guard `conn_mgr_executor.submit()` with try/except
`RuntimeError` (log debug, skip). Add `if self.stopped: return` early
exit in `process_frame_task()`.

**Regression tests**: 4 new tests in
`test_conn_manager_shutdown_race.py`.

### `pipe_type` / `pipe_root_path` for `SwarmLearningRecipe`

`SwarmLearningRecipe` always created a `CellPipe`; users needing
`FilePipe` (restricted networks, third-party integrations) had to drop
to the low-level `CCWFJob` + `SwarmClientConfig` API.

**Change**: Add `pipe_type: str = "cell_pipe"` and `pipe_root_path:
Optional[str] = None`:
- `"cell_pipe"` (default): unchanged — zero-copy PASS_THROUGH
forwarding, ~1 GB RAM overhead
- `"file_pipe"`: `FilePipe` created with `root_path` defaulting to
`{WORKSPACE}/pipe` (resolved at runtime); custom absolute path validated
to exist at recipe construction time
- Warnings for `pipe_root_path` ignored with `cell_pipe`, and
`file_pipe` with `launch_external_process=False`
- `ScriptRunner.__init__` gains `task_pipe` parameter forwarded to
`BaseScriptRunner`

9 new tests in `TestSwarmLearningRecipePipeType`.

## Files Changed

| File | Bug | Change |
|------|-----|--------|
| `nvflare/fuel/utils/fobs/decomposers/via_downloader.py` | 1 + 3 |
Remove `subscribe_to_msg_root`; add thread-local `_tls`,
`was_download_initiated()`, `clear_download_initiated()` |
| `nvflare/client/flare_agent.py` | 3 | Check `was_download_initiated()`
after `send_to_peer()`; return immediately for validate results |
| `nvflare/app_common/ccwf/cse_client_ctl.py` | 2 | Persistor-first
`_prepare_local_model()` with last-"best"-key inventory scan and
isinstance guard |
| `nvflare/app_opt/pt/recipes/swarm.py` | 4 + rename + pipe | Add
`round_timeout`, `pipe_type`, `pipe_root_path`; rename to
`SwarmLearningRecipe`; backward-compat alias kept |
| `nvflare/job_config/script_runner.py` | pipe | Add `task_pipe` param
to `ScriptRunner`, forwarded to `BaseScriptRunner` |
| `nvflare/app_common/ccwf/recipes/swarm.py` | rename | Export
`SwarmLearningRecipe` alongside `SimpleSwarmLearningRecipe` |
| `nvflare/app_opt/pt/recipes/__init__.py` | rename | Add
`SwarmLearningRecipe` to lazy imports and `__all__` |
| `nvflare/app_common/executors/client_api_launcher_executor.py` | 5 |
Restore `self._max_resends` (private convention) |
| `nvflare/app_common/launchers/subprocess_launcher.py` | logging |
`_route_subprocess_line()`: formatted lines → `print()`, raw lines →
`logger.info()` |
| `nvflare/client/ex_process/api.py` | logging |
`_configure_subprocess_logging()`: apply site log config unchanged (same
handlers as parent) |
| `nvflare/fuel/f3/sfm/conn_manager.py` | 6 | Guard executor submit +
frame processing against post-shutdown RuntimeError |
| `docs/programming_guide/memory_management.rst` | docs | Add missing
`min_clients` to example |
| `docs/user_guide/data_scientist_guide/available_recipes.rst` | docs |
Add missing `min_clients`; rename to `SwarmLearningRecipe` |
| `docs/programming_guide/controllers/client_controlled_workflows.rst` |
docs | Add missing `min_clients` (x2) |
| `docs/programming_guide/timeouts.rst` | docs | Rename to
`SwarmLearningRecipe` |
| `examples/advanced/swarm_learning/swarm_pt/README.md` | 4 + rename |
Add `round_timeout` to config table; rename to `SwarmLearningRecipe` |
| `examples/advanced/swarm_learning/swarm_pt/job.py` | rename | Update
import to `SwarmLearningRecipe` |
| `tests/unit_test/fuel/utils/fobs/test_via_downloader_msg_root.py` | 1
| 4 new tests |
| `tests/unit_test/client/test_download_initiated_gating.py` | 3 | 7 new
tests |
| `tests/unit_test/app_common/ccwf/test_cse_persistor_fallback.py` | 2 |
10 new tests (incl. initial-ckpt-shadowing regression) |
| `tests/unit_test/fuel/f3/sfm/test_conn_manager_shutdown_race.py` | 6 |
4 new regression tests |
| `tests/unit_test/app_common/launchers/subprocess_launcher_test.py` |
logging | 6 new tests for `_route_subprocess_line` and double-logging
prevention |
| `tests/unit_test/recipe/swarm_recipe_test.py` | rename + pipe |
Updated to `SwarmLearningRecipe`; 9 new
`TestSwarmLearningRecipePipeType` tests |
|
`tests/unit_test/app_common/executors/client_api_launcher_executor_test.py`
| 5 | Updated: `_max_resends` assertions |
| `tests/unit_test/recipe/component_config_verification_test.py` |
rename | Updated to `SwarmLearningRecipe` |

## Test Plan

- [x] 250+ unit tests pass across all affected modules
- [x] Bug 1: `test_via_downloader_msg_root.py` — verifies
`subscribe_to_msg_root` never called, method removed,
`DOWNLOAD_COMPLETE_CB` unaffected
- [x] Bug 2: `test_cse_persistor_fallback.py` — verifies persistor-first
path, last-"best"-key preference, initial-ckpt-shadowing regression,
isinstance guard, all fallback paths
- [x] Bug 3: `test_download_initiated_gating.py` — verifies thread
isolation, validate returns immediately (<1s), train waits,
clear-before-send
- [x] Bug 4: covered by existing recipe and ccwf tests
- [x] Bug 5: `client_api_launcher_executor_test.py` updated
(`_max_resends` assertions)
- [x] Bug 6: `test_conn_manager_shutdown_race.py` — 4 tests:
executor-shutdown no-raise, stop() no-raise, stopped early-return,
Prefix.from_bytes not called
- [x] Logging: `subprocess_launcher_test.py` — double-logging
prevention, ANSI stripping, plain-line capture, partial-timestamp
detection
- [x] Pipe type: `swarm_recipe_test.py` — 9 tests: default cell_pipe,
file_pipe instance, workspace template, custom path, invalid type,
warnings, path validation
- [x] Rename: backward-compat `SimpleSwarmLearningRecipe` import test
passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes 5948918 .

### Description

jobs created via ET subclass need to have the capability to set and pass
this parameter

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
### Description

Make integration tests' install_requirements more robust

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>
Fixes # .

### Description

Summary of changes for the ScaffoldRecipe + Client API scaffold_c_diff
KeyError:

1. Defensive check in scaffold_aggregate_fn
File: nvflare/app_common/workflows/scaffold.py

Before using _result.meta[AlgorithmConstants.SCAFFOLD_CTRL_DIFF], the
code now checks that the key exists. If it is missing, it raises a
ValueError that:

Names the client (if present in meta)
States that scaffold_c_diff is required in FLModel.meta
Explains that the client must use PTScaffoldHelper: init(model),
model_update() during training, terms_update() after training, and send
get_delta_controls() in meta
Points to nvflare.app_opt.pt.scaffold.PTScaffoldHelper
So instead of an opaque KeyError: 'scaffold_c_diff', users get a clear
message and next steps.

2. Stronger docstring for ScaffoldRecipe
File: nvflare/app_opt/pt/recipes/scaffold.py

The Client script requirement section now explicitly states that:

Scaffold is not like FedAvgRecipe: the client must use PTScaffoldHelper
The client must set meta[AlgorithmConstants.SCAFFOLD_CTRL_DIFF] =
scaffold_helper.get_delta_controls() in the returned FLModel
A plain flare.receive/send loop without PTScaffoldHelper will cause
aggregation to fail

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
## Summary

### Original documentation updates
- **Release notes**: large-model subprocess reliability section, 7 bug
fix entries, edge recipe improvements
- **available_recipes**: remove \`SimpleSwarmLearningRecipe\` alias, add
\`round_timeout\`, subprocess tuning notes
- **timeout_troubleshooting**: subprocess result timeout and Swarm P2P
transfer scenarios, new quick-reference rows
- **timeouts.rst**: \`max_resends\`, \`np/tensor_min_download_timeout\`,
\`round_timeout\` cross-refs, updated Swarm example
- **memory_management.rst**: client training process memory cleanup
section, \`NVFLARE_CLIENT_MEMORY_PROFILE\` env var
- **client_controlled_workflows.rst**: \`round_timeout\` in Swarm
examples, Client Dropout Tolerance section
- **industry_use_cases.rst**: FAITE, JP Morgan/BNY/RBC, ORNL OLCF, AV,
Holoscan, Data Federation Mesh; FLARE Day talk links added

### Broken link and consistency fixes
- **examples/README.md**: remove Section 2 step-by-step examples
(notebooks no longer exist), fix CIFAR-10 paths (\`cifar10/\` →
\`cifar10/pt/\`), fix \`lr-newton-raphson\` → \`hello-world/hello-lr\`,
remove dead \`code-pre-install\` entry, renumber sections
- **examples/advanced/README.md**: fix CIFAR-10 paths, remove dead
\`random_forest\`, \`brats18\`, \`prostate\`, \`fl_hub\` entries
- **examples/hello-world/README.md**: remove broken \`hello-ccwf\` link
- **docs/examples/medical_image_analysis.rst**: fix
\`brats18\`/\`prostate\` paths (\`examples/advanced/\` → \`research/\`)
- **timeout_troubleshooting.rst**: fix LLM example
\`submit_task_result_timeout\` (1200 → 1800) to be ≥
\`submit_result_timeout\`; clarify note on setting both timeouts
consistently

### Web tutorial catalog updates (web/src/components/)
- **gettingStarted.astro**: fix two broken \`getting_started/\` links —
Step 3 now points to \`job_api/pt/src/cifar10_fl.py\`, Step 4 to
readthedocs quickstart
- **series.astro**: fix \`getting_started\` → \`hello-world\`,
\`hello-fedavg\` → \`hello-pt\`, CIFAR-10 paths,
\`brats18\`/\`prostate\` → \`research/\`, \`xgboost_secure\` path
- **tutorials.astro**: add 17 missing examples — Job Recipe, Logging
Tutorial, Hello PyTorch Lightning, Hello PyTorch Lightning Eval, Hello
Differential Privacy, Hello Flower, Federated Logistic Regression with
Newton-Raphson, Split Learning, PSI, Tensor Streaming for LLMs, AMPLIFY
Protein Model, Confidential Computing Provision, Cross-Edge FL, BraTS18,
FedHCA2, FedOBD, Federated Prostate Segmentation

## Test plan
- [ ] Verify RST renders correctly
- [ ] Check all external links resolve
- [ ] Verify web tutorial catalog links are accessible

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Fix: subprocess cell torn down before server tensor download
(large-model deadlock)

### Problem

`big_model_4g` CI test reliably failed with:

```
RuntimeError: failed to download from site-X_active
```

after 1–3 hours of retry loops (300 s SubmitUpdate timeout × 3 download
retries).

**Root cause 1 — ordering bug in `_finalize_external_execution()`:**

The subprocess client API works by keeping the subprocess's CellNet
connection alive
after it sends its result, so the server can pull large tensors directly
from the
subprocess's `DownloadService` (the "reverse PASS_THROUGH" path). To
prevent the
subprocess from exiting before the download completes,
`_do_submit_result()` blocks
on a `download_done.wait(1800 s)` event that fires only after the server
has
finished downloading all tensors.

However, `_finalize_external_execution()` called `stop_task()`
**synchronously**,
which sends `SIGTERM` to the subprocess immediately after `execute()`
receives the
result — before `ClientRunner` had even sent `SubmitUpdate` to the
server, let alone
before the server had started downloading. This tore down the subprocess
cell,
causing every server download attempt to hit `"no connection to
site-X_active"` /
`"cannot forward req: no path"`.

The subprocess-side wait was therefore unreachable: the process was
killed
externally before it could block.

**Root cause 2 — `launch_once=True` subprocess exits after round 1:**

`_do_submit_result()` called `os._exit(0)` unconditionally at the end of
the
`CellPipe + pass_through_on_send` path. For `launch_once=False` (one
subprocess per
round) this is correct — the process should exit immediately after the
download so
the deferred-stop poller unblocks. But for `launch_once=True` (one
subprocess for
all rounds, e.g. `np-loop-cell-pipe`, `pt_client_api_launch_once`), the
subprocess
was killed after round 1, leaving rounds 2–N unhandled and the job
hanging.

---

### Fix

**1. Deferred `stop_task()` (`launcher_executor.py`)**

Instead of calling `stop_task()` synchronously,
`_finalize_external_execution()`
now starts a background thread (when `_stop_task_wait_timeout > 0`) that
polls
`check_run_status()` until the subprocess exits naturally (i.e. after
`download_done` fires and the subprocess unblocks), then calls
`stop_task()`.
`execute()` returns immediately, so `ClientRunner` can send
`SubmitUpdate` and the
server can connect to the still-alive subprocess cell to download
tensors.

**2. Round-boundary coordination (`_deferred_stop_event`)**

Without additional synchronization, the deferred thread from round N
could fire
*after* round N+1's `launch_task()` call. Since `SubprocessLauncher`
guards
`_start_external_process()` with `if self._process is None`, seeing a
not-yet-cleared
reference to the exited round-N process causes it to skip starting a new
subprocess.
Round N+1 then sees `COMPLETE_SUCCESS` immediately and fails with
`"External process has not called flare.init and run status becomes
success"`.

A `threading.Event` (`_deferred_stop_event`, initially set) coordinates
the two:

- **Cleared** in `_finalize_external_execution()` just before the
deferred thread starts.
- **Set** unconditionally in a `finally` block when the deferred thread
completes.
- `_initialize_external_execution()` **waits** on this event before
calling
  `launch_task()`, with a forced `stop_task()` fallback if it times out.

In practice the wait is 0–1 s (subprocess exits naturally after
download, well
before the server completes global aggregation and dispatches the next
round's task),
so there is no meaningful latency impact.

**3. `ClientAPILauncherExecutor` opt-in
(`client_api_launcher_executor.py`)**

`_stop_task_wait_timeout` is set to `download_complete_timeout` (default
1800 s) in
`ClientAPILauncherExecutor.__init__()`, enabling the deferred path only
for the
subprocess client API where large-tensor downloads are expected. The
base
`LauncherExecutor` defaults to `0.0` (original synchronous behaviour, no
change).

**4. `launch_once`-aware subprocess exit (`flare_agent.py`, `config.py`,
`ex_process/api.py`)**

`prepare_config_for_launch()` now writes `launch_once` (derived from
`launcher.needs_deferred_stop()`) into the subprocess config file. The
subprocess
reads it via `ClientConfig.get_launch_once()` and passes it to
`FlareAgent`.
`_do_submit_result()` branches on `_launch_once`:

| `launch_once` | Behaviour after download gate |
|---|---|
| `False` (one subprocess per round, e.g. `pt-client-api`) |
`os._exit(0)` called directly — deferred-stop poller unblocks
immediately (original behaviour preserved) |
| `True` (one subprocess for all rounds, e.g. `np-loop-cell-pipe`) |
`atexit.register(os._exit, 0)` registered once — subprocess continues to
next round; `os._exit` fires only when `main()` finally returns,
bypassing non-daemon CoreCell thread cleanup |

`_resolve_launch_once()` safely fetches the launcher even when
`self.launcher` is
still `None` at `initialize()` time (resolved directly from the engine
component
registry).

**5. Pipe handler identity guard and safe close (`task_exchanger.py`)**

Pipe handler creation is refactored into `_create_pipe_handler()`, which
binds a
per-handler status callback that checks `self.pipe_handler is _h` before
acting.
This prevents a late `PEER_GONE` from round N's (now-stale) handler from
stopping
round N+1's handler. `stop(close_pipe=False)` is used because
`CellPipe.close()`
is irreversible — closing it in the status callback would prevent the
next round
from communicating. An explicit `self.pipe.close()` is added in
`END_RUN` instead.

**6. Defensive logging and heartbeat error handling
(`pipe_handler.py`)**

`_send_to_pipe` now logs `asked_to_stop` and `abort_triggered` status
when a send
is suppressed. Peer-gone detection logs the elapsed time since the last
heartbeat.
Heartbeat sends are wrapped in `try/except` so a broken pipe sets
`asked_to_stop`
and breaks the heartbeat loop cleanly instead of propagating an
unhandled exception.

**7. New unit tests (`test_download_complete_gating.py`,
`test_download_initiated_gating.py`)**

Two new test files covering the subprocess download-gating behaviour
(the
`download_done` wait, the validate-path fast return, and `launch_once`).
Both use `FlareAgent.__new__()` to construct minimal agent
stubs. To prevent `os._exit(0)` from killing the pytest-xdist worker:

- An `autouse` `_no_os_exit` fixture patches
`nvflare.client.flare_agent.os._exit`
  to a no-op for every test in the file.
- `_make_agent()` sets `agent._launch_once = False` (the per-round path
that calls
  `os._exit` directly, making the fixture's patch the active guard).

---

### Files changed

| File | Change |
|------|--------|
| `nvflare/app_common/executors/launcher_executor.py` | Deferred
`stop_task()` background thread + `_deferred_stop_event` round-boundary
coordination |
| `nvflare/app_common/executors/client_api_launcher_executor.py` | Set
`_stop_task_wait_timeout = download_complete_timeout`; add
`_resolve_launch_once()`; write `LAUNCH_ONCE` to subprocess config |
| `nvflare/app_common/abstract/launcher.py` | `needs_deferred_stop()`
abstract method + idempotency/thread-safety note on `stop_task()` |
| `nvflare/app_common/launchers/subprocess_launcher.py` | Implement
`needs_deferred_stop()`; add info logging for process start/stop |
| `nvflare/app_common/executors/task_exchanger.py` | Refactor pipe
handler creation into `_create_pipe_handler()` with identity-checking
callback; use `close_pipe=False` to prevent irreversible
`CellPipe.close()` |
| `nvflare/app_common/widgets/metric_relay.py` | Include `msg.data` in
pipe status log message |
| `nvflare/fuel/utils/pipe/pipe_handler.py` | Enhanced logging (send
failures, peer-gone elapsed time); heartbeat send error handling |
| `nvflare/client/config.py` | Add `LAUNCH_ONCE` config key +
`get_launch_once()` |
| `nvflare/client/flare_agent.py` | `launch_once`-aware
`_do_submit_result()`: direct `os._exit` vs `atexit` |
| `nvflare/client/ex_process/api.py` | Pass `launch_once` to
`FlareAgentWithFLModel` |
| `tests/unit_test/client/test_download_complete_gating.py` | **New** —
tests for DOWNLOAD_COMPLETE_CB registration, download-done wait,
timeout, status logging, and cleanup |
| `tests/unit_test/client/test_download_initiated_gating.py` | **New** —
tests for thread-local download-initiation detection (validate-path fast
return, no spurious 1800 s wait) |
|
`tests/unit_test/app_common/executors/client_api_launcher_executor_test.py`
| New tests for deferred stop, `_deferred_stop_event`,
`_stop_task_wait_timeout` |



### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.
…68367)

The FOBS Packer.unpack() path validated the decomposer name against
BUILTIN_DECOMPOSERS but never validated the attacker-controlled
fobs_type field before passing it to load_class(). An authenticated
FL participant could set fobs_dc to a trusted builtin decomposer
(e.g. DataClassDecomposer) and fobs_type to an arbitrary Python class
path, causing importlib.import_module() to execute import-time code on
the server (full RCE, CVSS 8.8).

Fix:
- Add BUILTIN_TYPES allowlist in builtin_decomposers.py containing all
  NVFlare data-classes and enum types that are legitimately deserialized
  via generic decomposers (DataClassDecomposer / EnumTypeDecomposer).
- Introduce _type_name_whitelist in fobs.py, pre-seeded from
  BUILTIN_TYPES, that gates load_class(type_name) in Packer.unpack().
- register_data_classes() and register_enum_types() automatically add
  to the whitelist so app-registered types continue to work.
- Add public add_type_name_whitelist() API for custom job types that
  need to be explicitly allowed before deserialization.
- reset() intentionally does not clear the whitelist so that types
  auto-registered during pack() in the same process remain accessible
  after a reset (preserves existing unit-test round-trip patterns).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nvidianz
Copy link
Collaborator Author

Still messed up, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants