Cherry-pick [2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce (#4209)#4288
Conversation
Greptile SummaryThis PR fixes a cascading startup failure chain in large-scale hierarchical FL jobs (e.g. 144 clients, 6 relays on Frontier) by addressing five distinct job lifecycle bugs. The changes span Key changes and observations:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant JR as JobRunner
participant Eng as Engine
participant CR as check_client_replies
participant FS as FederatedServer
participant CRM as ClientRunManager
Note over JR: _deploy_job()
JR->>Eng: send_requests_and_get_reply_dict(deploy)
Eng-->>JR: token_to_reply (None = timeout)
JR->>JR: reply=None adds to failed_clients (NEW)
JR->>JR: min_sites / required_sites abort check
Note over JR: _start_run()
JR->>JR: job.meta[JOB_CLIENTS] = ALL clients
JR->>Eng: start_app_on_server(job)
JR->>Eng: start_client_job sends JOB_CLIENTS=ALL to each client
Eng-->>JR: replies list
JR->>CR: check_client_replies(replies, strict=flag)
alt strict=True
CR-->>JR: timed_out list
JR->>JR: required_sites check then min_sites check
JR->>JR: active_client_sites = all minus timed_out
else strict=False
CR-->>JR: empty list
JR->>JR: rebuild active_client_sites from non-None replies
end
JR->>JR: job.meta[JOB_CLIENTS] = ACTIVE only
Note over JR: clients already received JOB_CLIENTS=ALL
Note over FS: _sync_client_jobs() on heartbeat
FS->>FS: server_jobs = run_processes.keys()
FS->>FS: acquire _job_reported_clients_lock
FS->>FS: cleanup stale entries
FS->>FS: record positive observations
alt require_previous_report=True (NEW default)
FS->>FS: skip dead-job if token not in reported_clients
else require_previous_report=False (legacy)
FS->>Eng: notify_dead_job immediately
end
FS->>FS: release lock
Note over CRM: get_job_clients() on client
CRM->>CRM: validate job_meta is dict (NEW)
CRM->>CRM: validate JOB_CLIENTS exists and is list (NEW)
CRM->>CRM: all_clients may still include timed-out participants
Last reviewed commit: 21e4122 |
There was a problem hiding this comment.
Pull request overview
This PR cherry-picks a set of fixes to harden hierarchical FL job startup and early lifecycle behavior by properly classifying deployment/start timeouts, allowing selective continuation under min_sites, and debouncing dead-job detection until a client has positively reported the job.
Changes:
- Treat deployment
reply=Noneas a deployment failure and record a timeout-specific deploy detail label. - Extend
check_client_replies()to (optionally) return timed-out clients in strict mode and use name-keyed lookup to avoid order sensitivity. - Debounce
_sync_client_jobs()dead-job reporting via a new default-on “require prior report” mechanism, and add client-side job meta validation.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| nvflare/private/fed/server/job_runner.py | Classifies deployment timeouts as failures; adds strict start-job reply handling and timeout-aware min_sites logic. |
| nvflare/private/fed/server/admin.py | Updates check_client_replies() API to return timed-out clients (strict) and fixes non-strict lookup robustness. |
| nvflare/private/fed/server/fed_server.py | Changes dead-job detection to require prior positive report by default; moves tracking to server attribute with cleanup. |
| nvflare/private/fed/client/client_run_manager.py | Validates JOB_CLIENTS job meta presence/type with clearer errors. |
| nvflare/apis/fl_constant.py | Adds config var names for the new server-side safety flags. |
| docs/user_guide/timeout_troubleshooting.rst | Documents the new startup/dead-job safety flags and guidance. |
| docs/programming_guide/timeouts.rst | Adds a dedicated section documenting the new flags and recommended usage. |
| tests/unit_test/private/fed/server/admin_test.py | Adds unit coverage for new check_client_replies() strict/non-strict behaviors. |
| tests/unit_test/private/fed/server/job_runner_test.py | Adds unit coverage for strict flag wiring and start-job timeout tolerance/exclusion behavior. |
| tests/unit_test/private/fed/server/job_runner_deploy_test.py | Adds focused unit coverage for deployment-timeout classification and min_sites/required_sites abort logic. |
| tests/unit_test/private/fed/server/fed_server_test.py | Adds unit coverage for prior-report gating and tracking cleanup behavior in _sync_client_jobs(). |
| tests/unit_test/private/fed/client/client_run_manager_test.py | Adds unit coverage for new meta validation behavior in get_job_clients(). |
| tests/unit_test/private/fed/client/init.py | Adds package marker for new client unit tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ctive client exclusion, and dead-detection debounce (NVIDIA#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
19140f5 to
7477b28
Compare
|
/build |
Problem
Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in
Round 0 due to a cascading startup failure chain. The root sequence is:
_deploy_job()treatsreply=None(timeout) as"unknown"— not a failure — sotimed-out clients silently appear to have been deployed
_start_run()tries to start those clients; they again time out, andcheck_client_replies()ignores theNonereply_sync_client_jobs()fires dead-job notification on the very first heartbeat withno startup grace period
TypeError: 'NoneType' object is not iterablewhenget_job_clients()receivesNonemetadata from an already-aborted jobPRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the
transport layer. This PR addresses the remaining job lifecycle layer.
Fixes Included
1 —
_deploy_job(): Treat deployment timeout as failure (job_runner.py)Root bug:
reply=Nonewas logged as"unknown"and excluded fromfailed_clients,so timed-out clients counted as "successfully deployed" for the
min_sitescheck.Fix: Add timed-out clients to
failed_clientswith a"deployment timeout"label.The existing
min_sites/required_siteslogic then correctly decides whether to abort.2 —
check_client_replies(): Return timed-out clients instead of raising (admin.py)Root bug: In strict mode, any timeout raised immediately, aborting the whole job even
when the remaining active clients satisfied
min_sites.Fix: In strict mode, collect timed-out clients into a return list rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional
zip().New signature:
check_client_replies(...) -> List[str](timed-out client names; empty = none).3 —
_start_run(): Selective exclusion with min_sites re-evaluation (job_runner.py)Root bug: A start-job timeout under strict mode aborted the entire job with no
tolerance for stragglers within
min_sitesbounds.Fix: Use the returned timed-out list from
check_client_replies(). If remainingactive clients >=
min_sites, log a warning and proceed. Only abort when below tolerance.4 —
_sync_client_jobs(): Require-prior-report default changed toTrue(fed_server.py)Root bug:
SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORTdefaulted toFalse, meaningthe bug fix was opt-in and the unsafe behaviour remained the default.
Fix: Default changed to
True. Operators who want the aggressive legacy detectioncan set it to
Falseexplicitly.5 —
_sync_client_jobs(): Move_reported_clientsout ofjob_infodict (fed_server.py)Root bug: Positive-observation tracking was stored as
job_info["_reported_clients"],injecting algorithm state into a data dict with no corresponding
RunProcessKeyconstant.Fix: Tracking moved to
self._job_reported_clients: Dict[str, set]onFederatedServer.Stale entries are purged whenever a job is no longer in
run_processes.6 —
ClientRunManager.get_job_clients(): Explicit meta validation (client_run_manager.py)Raises
RuntimeErrorwith a descriptive message instead of an opaqueTypeErrorwhenJOB_CLIENTSis absent or the wrong type.Configuration Recommendations (No Code Change Needed)
FedAvg(min_clients=...)num_clientsrunner_sync_timeout120sstrict_start_job_reply_checktruesync_client_jobs_require_previous_reporttrue(now the default)Files Changed
nvflare/private/fed/server/job_runner.py—_deploy_job()timeout as failure;_start_run()selective exclusionnvflare/private/fed/server/admin.py—check_client_replies()returns timed-out list; dict-keyed non-strict pathnvflare/private/fed/server/fed_server.py—_sync_client_jobs()defaultTrue;_job_reported_clientsattr; stale cleanupnvflare/private/fed/client/client_run_manager.py— explicit meta validation inget_job_clients()Test Coverage
New and updated unit tests with both positive and negative cases:
admin_test.pyjob_runner_test.pyfed_server_test.pyAll 29 targeted unit tests pass.
Test Plan
job_runner_deploy_test.pycovering deployment timeout classification end-to-endstrict_start_job_reply_check=trueand reducedmin_clientsFixes # .
Description
A few sentences describing the changes proposed in this pull request.
Types of changes
./runtest.sh.