Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a 3-mode ignore_result_error policy (dynamic/strict/resilient) to key FL controllers via shared utilities, and introduces an ExecEnv.stop() hook to ensure execution environments (notably POC) can be cleaned up after a run.
Changes:
- Added shared utilities for deciding whether to ignore client result errors and for generating consistent log/panic messages.
- Updated
BaseModelControllerand Scatter-and-Gather workflows to supportignore_result_error=None(dynamic) with per-task failure tracking and special handling for unknown/late tasks. - Added
ExecEnv.stop()and wiredRun.get_result()to invoke environment cleanup; updated POC env stop logic and tests.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/unit_test/recipe/poc_env_test.py |
Updates POC stop test to reflect idempotent running-state checks. |
tests/unit_test/app_common/utils/error_handling_utils_test.py |
New unit tests covering the 3-mode error handling utilities and controller integration points. |
nvflare/recipe/spec.py |
Adds ExecEnv.stop() default no-op cleanup hook. |
nvflare/recipe/run.py |
Calls exec_env.stop() from Run.get_result() (cleanup integration). |
nvflare/recipe/poc_env.py |
Makes POC stop() more explicitly idempotent and optionally removes workspace when already stopped. |
nvflare/app_common/workflows/scatter_and_gather_scaffold.py |
Changes default ignore_result_error to None and documents 3-mode behavior. |
nvflare/app_common/workflows/scatter_and_gather.py |
Implements 3-mode result error handling with dynamic tracking and unknown-task behavior. |
nvflare/app_common/workflows/scaffold.py |
Updates docstring to describe new ignore_result_error semantics. |
nvflare/app_common/workflows/base_model_controller.py |
Implements dynamic error policy tracking/reset per task and propagates accept/reject from _accept_train_result(). |
nvflare/app_common/utils/error_handling_utils.py |
New shared utility functions for ignore decision + message formatting. |
nvflare/app_common/utils/__init__.py |
Exposes the new utility functions via package exports. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Store task context for dynamic ignore_result_error mode | ||
| num_targets = len(targets) if targets else len(self.engine.get_clients()) | ||
| self._current_min_responses = min_responses if min_responses > 0 else num_targets |
There was a problem hiding this comment.
In broadcast_model(), num_targets = len(targets) if targets else len(self.engine.get_clients()) treats an empty targets=[] the same as targets=None (all clients). That makes the dynamic ignore_result_error=None tolerance math incorrect, and it also conflicts with later logic that passes targets through to broadcast_and_wait. Use an explicit targets is not None check (and consider validating that targets is non-empty if empty is not meaningful).
| # Now try to convert result to FLModel | ||
| try: | ||
| result_model = FLModelUtils.from_shareable(result) | ||
| result_model.meta["props"] = client_task.task.props[AppConstants.META_DATA] | ||
| result_model.meta["client_name"] = client_name |
There was a problem hiding this comment.
_process_result() converts the raw Shareable to FLModel before calling _accept_train_result(), and then converts it again inside the new try/except block. This duplicates work and (more importantly) the first conversion is unguarded and can raise for errored/invalid results that should have been rejected by _accept_train_result() first. Convert only once, after _accept_train_result() returns True, and keep the conversion inside the try/except.
| train_task_name=AppConstants.TASK_TRAIN, | ||
| train_timeout: int = 0, | ||
| ignore_result_error: bool = False, | ||
| ignore_result_error: bool = None, |
There was a problem hiding this comment.
The parameter is typed as bool but now defaults to None. Update the annotation to Optional[bool] (or bool | None) to match the new 3-mode behavior and avoid type-checker/IDE inconsistencies.
Greptile SummaryThis PR cherry-picks two changes from branch 2.7 to Key issues found:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Controller as ScatterAndGather/<br/>BaseModelController
participant EHU as error_handling_utils
participant Client as FL Client
Controller->>Client: broadcast_and_wait(task)
Note over Controller: Reset _current_failed_clients = set()<br/>Reset _current_num_targets = N
Client-->>Controller: result_received_cb (ClientTask)
Controller->>Controller: _accept_train_result()
alt rc == OK
Controller->>Controller: set TRAINING_RESULT in fl_ctx
Controller-->>Controller: return True (accepted)
else rc != OK
Controller->>EHU: should_ignore_result_error(mode, client, failed_set, N, min)
alt mode == True (Resilient)
EHU-->>Controller: True (ignore)
else mode == False (Strict)
EHU-->>Controller: False (panic)
else mode == None (Dynamic)
EHU->>EHU: failed_clients.add(client_name)
EHU->>EHU: remaining = N - len(failed_clients)
alt remaining >= min_responses
EHU-->>Controller: True (ignore)
else remaining < min_responses
EHU-->>Controller: False (panic)
end
end
Controller->>EHU: get_error_handling_message(mode, ...)
EHU-->>Controller: message string
alt should_ignore == True
Controller->>Controller: log warning(msg)
Controller-->>Controller: return False (rejected, error ignored)
else should_ignore == False
Controller->>Controller: panic(msg) / system_panic(msg)
Controller-->>Controller: return False (rejected, panic triggered)
end
end
|
…up (NVIDIA#4084) This PR introduces a flexible 3-mode error handling policy for FL controllers and adds proper cleanup for POC environment after job execution. - **Dynamic `ignore_result_error` mode**: Controllers now support three modes for handling client result errors: - `None` (default): **Dynamic mode** - ignore errors if `min_responses`/`min_clients` can still be reached, panic otherwise - `False`: **Strict mode** - always panic on any client error - `True`: **Resilient mode** - always ignore client errors and continue - **Shared utility functions**: Created reusable error handling logic to avoid code duplication across controllers - Added `stop()` method to `ExecEnv` base class (default no-op) - `Run.get_result()` now calls `exec_env.stop()` in a `finally` block - Ensures POC services are always stopped after job execution, making each run independent | File | Changes | |------|---------| | `nvflare/app_common/utils/error_handling_utils.py` | **New** - Shared utility functions: `should_ignore_result_error()`, `get_error_handling_message()` | | `nvflare/app_common/workflows/base_model_controller.py` | Updated `ignore_result_error` default to `None`, added 3-mode logic in `_accept_train_result()`, added tracking for `_current_failed_clients`, `_current_num_targets`, `_current_min_responses` | | `nvflare/app_common/workflows/scatter_and_gather.py` | Updated `ignore_result_error` default to `None`, added 3-mode logic in `_accept_train_result()`, added tracking variables, updated docstring | | `nvflare/app_common/workflows/scatter_and_gather_scaffold.py` | Updated `ignore_result_error` signature and docstring | | `nvflare/app_common/workflows/scaffold.py` | Updated `ignore_result_error` docstring | | `nvflare/recipe/spec.py` | Added `stop()` method to `ExecEnv` base class | | `nvflare/recipe/run.py` | Updated `get_result()` to call `exec_env.stop()` in finally block | | `tests/unit_test/app_common/utils/error_handling_utils_test.py` | **New** - 17 unit tests for error handling logic | Controllers that inherit from `BaseModelController` automatically get the new behavior: - `FedAvg`, `BaseFedAvg`, `Scaffold`, `Cyclic`, `PTFedAvg`, `FedOpt`, etc. Controllers with direct changes: - `ScatterAndGather`, `ScatterAndGatherScaffold` - [x] Added 17 unit tests for `ignore_result_error` logic - [x] All tests pass - [x] Code formatted with black, isort, flake8
## Summary This PR fixes an issue with the `Run` class where calling `get_status()` after `get_result()` would fail because the POC environment was already stopped. It also ensures proper cleanup of the POC workspace. ### Changes **`nvflare/recipe/run.py`** - Added state caching: `_stopped`, `_cached_status`, `_cached_result` attributes - Added proper logging using `get_obj_logger(self)` - `get_result()` now caches status before stopping POC for later retrieval - `get_status()` returns cached status after POC is stopped - `abort()` is a no-op after POC is stopped **`nvflare/recipe/spec.py`** - Updated `ExecEnv.stop()` signature to accept `clean_up: bool = False` parameter **`nvflare/recipe/poc_env.py`** - Renamed parameter `clean_poc` to `clean_up` for consistency - Workspace removal now only happens when `clean_up=True` (was unconditional before) **`tests/unit_test/recipe/run_test.py`** - Updated 13 unit tests to verify new behavior ## Test Plan - [x] 13 unit tests pass for `Run` class - [x] Integration test verified with hello-numpy example using PocEnv - Job executes and completes successfully - Status is cached correctly (`FINISHED:COMPLETED`) - Result is cached for subsequent calls - POC workspace is cleaned up after `get_result()` --------- Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>
86b57ee to
3c63fa2
Compare
Description
Cherry-pick 2.7 changes to main #4084 and #4132
Types of changes
./runtest.sh.