Cherry-pick [2.7] Fix recipe API bug list and harden recipe behavior (#4228)#4293
Cherry-pick [2.7] Fix recipe API bug list and harden recipe behavior (#4228)#4293YuanTingHsieh wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR is a comprehensive bug-fix and hardening pass for the NVFlare 2.7 Recipe API, addressing nine reported bugs ranging from phantom persistor references to silent per-site config override failures. The changes enforce deterministic model-source contracts, make Key changes and observations:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Recipe
participant Job as FedJob
participant Env as ExecEnv
User->>Recipe: execute(env, server_exec_params, client_exec_params)
activate Recipe
Recipe->>Recipe: _temporary_exec_params() [snapshot additional_params]
Recipe->>Job: to_server(server_exec_params)
Recipe->>Job: _add_to_client_apps(client_exec_params)
Recipe->>Recipe: process_env(env)
Recipe->>Env: deploy(job) → job_id
Recipe->>Recipe: [finally] _restore_additional_params(snapshot)
Recipe-->>User: Run(env, job_id)
deactivate Recipe
User->>Recipe: export(job_dir, server_exec_params, client_exec_params)
activate Recipe
Recipe->>Recipe: _temporary_exec_params() [snapshot additional_params]
Recipe->>Job: to_server(server_exec_params)
Recipe->>Job: _add_to_client_apps(client_exec_params)
Recipe->>Job: export_job(job_dir)
Recipe->>Recipe: [finally] _restore_additional_params(snapshot)
deactivate Recipe
Last reviewed commit: 6f78c56 |
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks a set of fixes into the 2.7 branch to harden recipe behavior around model/persistor configuration, initial_ckpt handling, per-site config semantics, and recipe reuse (execute/export), with corresponding regression tests and example updates.
Changes:
- Hardened
CyclicRecipe/FedAvgRecipemodel-source setup (persistor wiring,initial_ckptresolution, per-site config validation). - Made
Recipe.execute()/Recipe.export()reusable-safe via snapshot/restore of temporary execution params. - Added/updated utilities and expanded unit tests + refreshed examples/notebook content.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
nvflare/recipe/utils.py |
Adds helpers for persistor id extraction, initial checkpoint resolution, and stronger CSE idempotency checks; renames script collection helper to public API. |
nvflare/recipe/spec.py |
Adds temporary exec-param isolation to keep recipes reusable across execute() / export(). |
nvflare/recipe/sim_env.py |
Fixes num_clients/num_threads resolution when explicit clients list is provided. |
nvflare/recipe/prod_env.py |
Updates to use renamed collect_non_local_scripts helper. |
nvflare/recipe/poc_env.py |
Tightens validation for num_clients being None and updates script validation helper usage. |
nvflare/recipe/fedavg.py |
Hardens per-site config semantics, fails fast when no model source, and centralizes custom persistor setup. |
nvflare/recipe/cyclic.py |
Stores validated attributes (e.g., min_clients), enforces persistor availability, and hardens wrapper/ckpt behavior. |
nvflare/recipe/__init__.py |
Exposes add_cross_site_evaluation from top-level nvflare.recipe. |
nvflare/app_opt/pt/recipes/fedavg.py |
Refactors persistor setup to support custom persistors and consistent id extraction. |
nvflare/app_opt/pt/recipes/cyclic.py |
Refactors persistor setup and checkpoint resolution; adds explicit PT ckpt-without-model rejection. |
nvflare/app_opt/tf/recipes/fedavg.py |
Refactors persistor setup with shared helpers and checkpoint resolution. |
nvflare/app_opt/tf/recipes/cyclic.py |
Refactors persistor setup and checkpoint resolution; uses shared id extraction. |
tests/unit_test/recipe/utils_test.py |
Adds tests for new persistor helpers, package exports, and CSE idempotency behavior. |
tests/unit_test/recipe/spec_test.py |
Updates for renamed script collector and adds execute/export param-isolation regressions. |
tests/unit_test/recipe/sim_env_test.py |
Adds BUG-3 regression for client/thread derivation from explicit client list. |
tests/unit_test/recipe/poc_env_test.py |
Adds regression ensuring num_clients=None fails with ValueError rather than TypeError. |
tests/unit_test/recipe/fedavg_recipe_test.py |
Adds regressions for reserved per-site targets, falsy override preservation, and persistor/locator wiring. |
tests/unit_test/recipe/cyclic_recipe_test.py |
Expands base cyclic coverage for wrapper/ckpt behavior and persistor-id extraction. |
examples/hello-world/hello-numpy/job.py |
Switches to transfer-type API for update type selection. |
examples/hello-world/hello-cyclic/job.py |
Passes min_clients explicitly. |
examples/tutorials/job_recipe.ipynb |
Updates documentation to reflect relative/absolute checkpoint semantics and environment argument descriptions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
## Issues addressed This PR addresses the reported recipe bug list: - `BUG-8` (Critical): `CyclicRecipe` `model=None + initial_ckpt` created phantom persistor reference and runtime failure. - `BUG-1` (Critical): `CyclicRecipe` dict model config caused `TypeError`. - `BUG-7` (Critical): base `CyclicRecipe` silently ignored `initial_ckpt`. - `BUG-9` (High): `FedAvgRecipe` `per_site_config[\"server\"]` caused invalid target handling/crash. - `BUG-4` (High): `Recipe.execute/export` mutated job state and made recipe reuse unsafe. - `BUG-2` (Medium): `FedAvgRecipe` per-site `or` fallback ignored falsy overrides. - `BUG-3` (Medium): `SimEnv` with `num_clients=0` and client list resolved `num_threads=0`. - `BUG-5` (Low): base `FedAvgRecipe` accepted `initial_ckpt` + `model=None` but produced incomplete model source. - `BUG-6` (Low): `add_cross_site_evaluation` idempotency guard depended only on transient `_cse_added` attribute. ## Changes - Hardened base `CyclicRecipe` model/persistor setup: - fail-fast when a valid persistor cannot be configured; - support dict model config for supported frameworks; - apply/validate `initial_ckpt` for wrapper and framework-specific paths. - Hardened base `FedAvgRecipe`: - validate and reject reserved `per_site_config` targets (`server`, `@ALL`); - preserve falsy per-site override values via explicit `is not None` fallback; - fail-fast when no persistor and no model params are available. - Added shared framework persistor setup utility in recipe utils and wired both Cyclic/FedAvg paths to it. - Made `Recipe.execute()` and `Recipe.export()` reusable-safe by snapshot/restore of temporary execution params. - Fixed `SimEnv` default resolution for `num_clients`/`num_threads` when explicit client list is provided. - Hardened CSE idempotency with workflow-based detection in addition to `_cse_added` fast-path. - Updated impacted examples and exports: - `hello-cyclic` now passes `min_clients` explicitly; - `hello-numpy` uses transfer-type API; - recipe module exports include `add_cross_site_evaluation`. - Expanded recipe unit tests with targeted regressions for all above behaviors. ## Reason for these changes These updates enforce clear model-source contracts and remove silent/implicit fallback behavior that could generate invalid jobs or runtime crashes. The goal is deterministic recipe behavior, safe object reuse, and predictable per-site/CSE configuration semantics in 2.7. ## Affected files Core implementation: - `nvflare/recipe/cyclic.py` - `nvflare/recipe/fedavg.py` - `nvflare/recipe/spec.py` - `nvflare/recipe/sim_env.py` - `nvflare/recipe/utils.py` - `nvflare/recipe/poc_env.py` - `nvflare/recipe/__init__.py` Framework wrappers / examples: - `nvflare/app_opt/pt/recipes/cyclic.py` - `nvflare/app_opt/tf/recipes/cyclic.py` - `examples/hello-world/hello-cyclic/job.py` - `examples/hello-world/hello-numpy/job.py` - `examples/tutorials/job_recipe.ipynb` Tests: - `tests/unit_test/recipe/cyclic_recipe_test.py` - `tests/unit_test/recipe/fedavg_recipe_test.py` - `tests/unit_test/recipe/spec_test.py` - `tests/unit_test/recipe/sim_env_test.py` - `tests/unit_test/recipe/utils_test.py` ## Test strategy - Targeted regression suites: - `pytest -q tests/unit_test/recipe/cyclic_recipe_test.py tests/unit_test/recipe/fedavg_recipe_test.py tests/unit_test/recipe/spec_test.py` - `pytest -q tests/unit_test/recipe/sim_env_test.py tests/unit_test/recipe/utils_test.py` - Full recipe suite: - `pytest -q tests/unit_test/recipe` - Result: recipe suite passed (`183 passed, 13 skipped`). Made with [Cursor](https://cursor.com)
21a8790 to
fffa580
Compare
|
this is a cherry-pick additional comments will be addressed in another PR |
| f"Conflicting checkpoint values: model wrapper has initial_ckpt={existing_ckpt}, " | ||
| f"but recipe initial_ckpt={ckpt_path}." | ||
| ) | ||
| setattr(self.model, "initial_ckpt", ckpt_path) |
There was a problem hiding this comment.
In-place mutation of a caller-provided model wrapper
setattr(self.model, "initial_ckpt", ckpt_path) modifies the wrapper object that the caller passed in. If the same PTModel / TFModel instance is passed to two different recipes (or to execute + export in succession), the second call will observe the initial_ckpt that was injected by the first, which may not be what the caller intended.
Consider either copying the wrapper before mutating it, or documenting clearly that the recipe takes ownership of the wrapper and callers must not reuse the same instance:
import copy
model_wrapper = copy.copy(self.model)
setattr(model_wrapper, "initial_ckpt", ckpt_path)
result = job.to_server(model_wrapper, id="persistor")
return extract_persistor_id(result)A shallow copy is sufficient because only the initial_ckpt attribute is changed.
| if self._pt_model_locator is not None: | ||
| locator_id = job.to_server(self._pt_model_locator, id="locator") | ||
| if isinstance(locator_id, str) and locator_id: | ||
| job.comp_ids["locator_id"] = locator_id |
There was a problem hiding this comment.
locator_id silently dropped when job.to_server returns a dict
job.to_server(self._pt_model_locator, id="locator") can return a dict when the locator implements add_to_fed_job (e.g., a future custom PTFileModelLocator subclass or any locator wrapper). The guard isinstance(locator_id, str) and locator_id would then be False, so comp_ids["locator_id"] is never populated.
add_cross_site_evaluation later reads recipe.job.comp_ids.get("locator_id", "") (via locator_config), and a missing key leads to the locator not being wired up for cross-site evaluation without any error being raised — a silent failure.
Applying extract_persistor_id (or an equivalent extract_component_id helper) here would make the handling consistent with how persistor_id is resolved elsewhere in the same method:
if self._pt_model_locator is not None:
raw = job.to_server(self._pt_model_locator, id="locator")
locator_id = raw if isinstance(raw, str) and raw else (
raw.get("locator_id", "") if isinstance(raw, dict) else ""
)
if locator_id:
job.comp_ids["locator_id"] = locator_id
Issues addressed
This PR addresses the reported recipe bug list:
BUG-8(Critical):CyclicRecipemodel=None + initial_ckptcreated phantom persistor reference and runtime failure.BUG-1(Critical):CyclicRecipedict model config causedTypeError.BUG-7(Critical): baseCyclicRecipesilently ignoredinitial_ckpt.BUG-9(High):FedAvgRecipeper_site_config[\"server\"]caused invalid target handling/crash.BUG-4(High):Recipe.execute/exportmutated job state and made recipe reuse unsafe.BUG-2(Medium):FedAvgRecipeper-siteorfallback ignored falsy overrides.BUG-3(Medium):SimEnvwithnum_clients=0and client list resolvednum_threads=0.BUG-5(Low): baseFedAvgRecipeacceptedinitial_ckpt+model=Nonebut produced incomplete model source.BUG-6(Low):add_cross_site_evaluationidempotency guard depended only on transient_cse_addedattribute.Changes
CyclicRecipemodel/persistor setup:initial_ckptfor wrapper and framework-specific paths.FedAvgRecipe:per_site_configtargets (server,@ALL);is not Nonefallback;Recipe.execute()andRecipe.export()reusable-safe by snapshot/restore of temporary execution params.SimEnvdefault resolution fornum_clients/num_threadswhen explicit client list is provided._cse_addedfast-path.hello-cyclicnow passesmin_clientsexplicitly;hello-numpyuses transfer-type API;add_cross_site_evaluation.Reason for these changes
These updates enforce clear model-source contracts and remove silent/implicit fallback behavior that could generate invalid jobs or runtime crashes. The goal is deterministic recipe behavior, safe object reuse, and predictable per-site/CSE configuration semantics in 2.7.
Affected files
Core implementation:
nvflare/recipe/cyclic.pynvflare/recipe/fedavg.pynvflare/recipe/spec.pynvflare/recipe/sim_env.pynvflare/recipe/utils.pynvflare/recipe/poc_env.pynvflare/recipe/__init__.pyFramework wrappers / examples:
nvflare/app_opt/pt/recipes/cyclic.pynvflare/app_opt/tf/recipes/cyclic.pyexamples/hello-world/hello-cyclic/job.pyexamples/hello-world/hello-numpy/job.pyexamples/tutorials/job_recipe.ipynbTests:
tests/unit_test/recipe/cyclic_recipe_test.pytests/unit_test/recipe/fedavg_recipe_test.pytests/unit_test/recipe/spec_test.pytests/unit_test/recipe/sim_env_test.pytests/unit_test/recipe/utils_test.pyTest strategy
pytest -q tests/unit_test/recipe/cyclic_recipe_test.py tests/unit_test/recipe/fedavg_recipe_test.py tests/unit_test/recipe/spec_test.pypytest -q tests/unit_test/recipe/sim_env_test.py tests/unit_test/recipe/utils_test.pypytest -q tests/unit_test/recipe183 passed, 13 skipped).Made with Cursor
Fixes # .
Description
A few sentences describing the changes proposed in this pull request.
Types of changes
./runtest.sh.