Skip to content

Cherry-pick [2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce (#4209)#4288

Open
YuanTingHsieh wants to merge 2 commits intoNVIDIA:mainfrom
YuanTingHsieh:cherry-pick-4209
Open

Cherry-pick [2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce (#4209)#4288
YuanTingHsieh wants to merge 2 commits intoNVIDIA:mainfrom
YuanTingHsieh:cherry-pick-4209

Conversation

@YuanTingHsieh
Copy link
Collaborator

Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in
Round 0 due to a cascading startup failure chain. The root sequence is:

  1. F3 streaming HOL stall (PR [2.7] Mitigate F3 streaming Head-of-Line (HOL) stalls and add guardrails #4206) delays deployment ACKs from relay-connected clients
  2. _deploy_job() treats reply=None (timeout) as "unknown" — not a failure — so
    timed-out clients silently appear to have been deployed
  3. _start_run() tries to start those clients; they again time out, and
    check_client_replies() ignores the None reply
  4. _sync_client_jobs() fires dead-job notification on the very first heartbeat with
    no startup grace period
  5. FedAvg requires 144/144 — one or two missing clients → abort
  6. A late-starting CJ crashes with TypeError: 'NoneType' object is not iterable when
    get_job_clients() receives None metadata from an already-aborted job

PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the
transport layer. This PR addresses the remaining job lifecycle layer.


Fixes Included

1 — _deploy_job(): Treat deployment timeout as failure (job_runner.py)

Root bug: reply=None was logged as "unknown" and excluded from failed_clients,
so timed-out clients counted as "successfully deployed" for the min_sites check.

Fix: Add timed-out clients to failed_clients with a "deployment timeout" label.
The existing min_sites / required_sites logic then correctly decides whether to abort.

2 — check_client_replies(): Return timed-out clients instead of raising (admin.py)

Root bug: In strict mode, any timeout raised immediately, aborting the whole job even
when the remaining active clients satisfied min_sites.

Fix: In strict mode, collect timed-out clients into a return list rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional zip().

New signature: check_client_replies(...) -> List[str] (timed-out client names; empty = none).

3 — _start_run(): Selective exclusion with min_sites re-evaluation (job_runner.py)

Root bug: A start-job timeout under strict mode aborted the entire job with no
tolerance for stragglers within min_sites bounds.

Fix: Use the returned timed-out list from check_client_replies(). If remaining
active clients >= min_sites, log a warning and proceed. Only abort when below tolerance.

4 — _sync_client_jobs(): Require-prior-report default changed to True (fed_server.py)

Root bug: SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT defaulted to False, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

Fix: Default changed to True. Operators who want the aggressive legacy detection
can set it to False explicitly.

5 — _sync_client_jobs(): Move _reported_clients out of job_info dict (fed_server.py)

Root bug: Positive-observation tracking was stored as job_info["_reported_clients"],
injecting algorithm state into a data dict with no corresponding RunProcessKey constant.

Fix: Tracking moved to self._job_reported_clients: Dict[str, set] on FederatedServer.
Stale entries are purged whenever a job is no longer in run_processes.

6 — ClientRunManager.get_job_clients(): Explicit meta validation (client_run_manager.py)

Raises RuntimeError with a descriptive message instead of an opaque TypeError when
JOB_CLIENTS is absent or the wrong type.


Configuration Recommendations (No Code Change Needed)

Setting Recommended value Effect
FedAvg(min_clients=...) 96-98% of num_clients Tolerates a few startup stragglers
runner_sync_timeout 120 s Allows Lustre-backed deployments time to complete
strict_start_job_reply_check true Start-job timeouts surfaced, straggler clients excluded
sync_client_jobs_require_previous_report true (now the default) Prevents premature dead-job from startup delay

Files Changed

  • nvflare/private/fed/server/job_runner.py_deploy_job() timeout as failure; _start_run() selective exclusion
  • nvflare/private/fed/server/admin.pycheck_client_replies() returns timed-out list; dict-keyed non-strict path
  • nvflare/private/fed/server/fed_server.py_sync_client_jobs() default True; _job_reported_clients attr; stale cleanup
  • nvflare/private/fed/client/client_run_manager.py — explicit meta validation in get_job_clients()

Test Coverage

New and updated unit tests with both positive and negative cases:

File Tests What they cover
admin_test.py 8 Timeout returned not raised; dict lookup; error still raises; reorder OK
job_runner_test.py 4 strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise
fed_server_test.py 5 Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup

All 29 targeted unit tests pass.

Test Plan

  • Unit tests for each changed function (positive + negative)
  • New job_runner_deploy_test.py covering deployment timeout classification end-to-end
  • All 29 targeted unit tests pass
  • Hierarchical staging run with all flags at default
  • Hierarchical staging run with strict_start_job_reply_check=true and reduced min_clients
  • Verify no regression on standard (non-hierarchical) FL jobs

Fixes # .

Description

A few sentences describing the changes proposed in this pull request.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings March 11, 2026 01:18
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR fixes a cascading startup failure chain in large-scale hierarchical FL jobs (e.g. 144 clients, 6 relays on Frontier) by addressing five distinct job lifecycle bugs. The changes span job_runner.py (deployment timeout → failure, selective client exclusion), admin.py (check_client_replies returning timed-out clients instead of raising), fed_server.py (_sync_client_jobs dead-detection debounce and tracking refactor), and client_run_manager.py (explicit metadata validation). 29 targeted unit tests are added or updated across four test files.

Key changes and observations:

  • _deploy_job timeout fix (job_runner.py): reply=None is now correctly added to failed_clients, so the min_sites/required_sites abort logic works as intended rather than silently counting timed-out clients as successful.
  • check_client_replies refactor (admin.py): The positional zip() approach is replaced with a dict-keyed lookup; strict mode returns timed-out client names to the caller instead of raising, enabling selective exclusion. Explicit errors (non-OK return code or ERROR_MSG_PREFIX body) still raise unconditionally.
  • _start_run selective exclusion (job_runner.py): Timed-out clients are excluded from JOB_CLIENTS when active count still meets min_sites. However, start_client_job is called with JOB_CLIENTS=all before timeouts are resolved; the server-side trimming at line 320 happens after the start messages have already been delivered, so active clients' local JOB_CLIENTS contains timed-out participants.
  • Dead-detection debounce default (fed_server.py): SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT defaulted to False (aggressive detection); changed to True (require prior positive heartbeat). This is a default-behaviour change that may affect operators who relied on the old opt-in default. Positive-observation tracking is correctly moved off the job_info dict and onto a dedicated _job_reported_clients server attribute with a threading.Lock.
  • Explicit meta validation (client_run_manager.py): Converts opaque TypeError crashes into descriptive RuntimeError messages.

Confidence Score: 3/5

  • Safe to merge for the targeted bug scenarios; one design trade-off (client-side JOB_CLIENTS includes timed-out participants) warrants discussion before merging to avoid regressions in peer-aware FL algorithms.
  • The five individual fixes are well-reasoned and well-tested (29 targeted unit tests). The strict-mode path in _start_run correctly gates on min_sites and required_sites. However, the start_client_job call sends JOB_CLIENTS=all to active clients before the server-side trimming at line 320 takes effect; active clients initialise their local peer list with timed-out participants, which can cause issues in peer-aware algorithms. Additionally, the default change for SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT from False to True is a silent behavioural change for any deployment not explicitly setting that config variable. Staging validation (end-to-end hierarchical run) listed in the test plan is still pending.
  • Pay closest attention to nvflare/private/fed/server/job_runner.py (lines 264–320) — specifically the ordering of the initial and final JOB_CLIENTS writes relative to start_client_job.

Important Files Changed

Filename Overview
nvflare/private/fed/server/job_runner.py Two key fixes: deployment timeout classified as failure in _deploy_job, and _start_run updated with selective client exclusion. The JOB_CLIENTS metadata is written before start_client_job (so clients get the full list) but then trimmed to active-only afterward — active clients' local JOB_CLIENTS will include timed-out participants.
nvflare/private/fed/server/admin.py Refactored check_client_replies to use dict-keyed lookup (fixing positional zip bug), return timed-out clients in strict mode instead of raising, and still raise on explicit errors. The early count-guard at line 104 is now slightly redundant in strict mode (covered by the missing_clients check), but causes no correctness issue.
nvflare/private/fed/server/fed_server.py Positive-observation tracking moved from job_info dict to a dedicated _job_reported_clients server attribute with a lock; require_previous_report default changed to True to prevent premature dead-job notifications. Minor: server_jobs snapshot is taken outside the lock but used inside it, which is benign in practice.
nvflare/private/fed/client/client_run_manager.py Added explicit type and presence validation for JOB_META and JOB_CLIENTS in get_job_clients(), converting silent TypeError / AttributeError crashes into descriptive RuntimeError messages. Clean, well-tested fix.
nvflare/apis/fl_constant.py Added two new ConfigVarName constants: STRICT_START_JOB_REPLY_CHECK and SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT. Straightforward addition with no issues.
tests/unit_test/private/fed/server/job_runner_deploy_test.py New file with 9 tests covering _deploy_job timeout classification, min_sites/required_sites abort logic, and integration sequence. Thorough coverage of the deployment timeout path.
tests/unit_test/private/fed/server/admin_test.py 8 new tests cover strict/non-strict mode timeouts, dict-keyed lookup, error prefix detection, and reply reordering. Comprehensive coverage for the check_client_replies refactor.
tests/unit_test/private/fed/server/fed_server_test.py 5 new tests cover the require_previous_report default, legacy fire-immediately path, tracking in server attribute vs. job_info, and stale cleanup. Good coverage of the new _sync_client_jobs behavior.
tests/unit_test/private/fed/server/job_runner_test.py 4 new tests validate the strict flag wiring, tolerance-within-min-sites warning path, breach-min-sites raise, and active JOB_CLIENTS metadata update after timeout exclusion.

Sequence Diagram

sequenceDiagram
    participant JR as JobRunner
    participant Eng as Engine
    participant CR as check_client_replies
    participant FS as FederatedServer
    participant CRM as ClientRunManager

    Note over JR: _deploy_job()
    JR->>Eng: send_requests_and_get_reply_dict(deploy)
    Eng-->>JR: token_to_reply (None = timeout)
    JR->>JR: reply=None adds to failed_clients (NEW)
    JR->>JR: min_sites / required_sites abort check

    Note over JR: _start_run()
    JR->>JR: job.meta[JOB_CLIENTS] = ALL clients
    JR->>Eng: start_app_on_server(job)
    JR->>Eng: start_client_job sends JOB_CLIENTS=ALL to each client
    Eng-->>JR: replies list
    JR->>CR: check_client_replies(replies, strict=flag)
    alt strict=True
        CR-->>JR: timed_out list
        JR->>JR: required_sites check then min_sites check
        JR->>JR: active_client_sites = all minus timed_out
    else strict=False
        CR-->>JR: empty list
        JR->>JR: rebuild active_client_sites from non-None replies
    end
    JR->>JR: job.meta[JOB_CLIENTS] = ACTIVE only
    Note over JR: clients already received JOB_CLIENTS=ALL

    Note over FS: _sync_client_jobs() on heartbeat
    FS->>FS: server_jobs = run_processes.keys()
    FS->>FS: acquire _job_reported_clients_lock
    FS->>FS: cleanup stale entries
    FS->>FS: record positive observations
    alt require_previous_report=True (NEW default)
        FS->>FS: skip dead-job if token not in reported_clients
    else require_previous_report=False (legacy)
        FS->>Eng: notify_dead_job immediately
    end
    FS->>FS: release lock

    Note over CRM: get_job_clients() on client
    CRM->>CRM: validate job_meta is dict (NEW)
    CRM->>CRM: validate JOB_CLIENTS exists and is list (NEW)
    CRM->>CRM: all_clients may still include timed-out participants
Loading

Last reviewed commit: 21e4122

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks a set of fixes to harden hierarchical FL job startup and early lifecycle behavior by properly classifying deployment/start timeouts, allowing selective continuation under min_sites, and debouncing dead-job detection until a client has positively reported the job.

Changes:

  • Treat deployment reply=None as a deployment failure and record a timeout-specific deploy detail label.
  • Extend check_client_replies() to (optionally) return timed-out clients in strict mode and use name-keyed lookup to avoid order sensitivity.
  • Debounce _sync_client_jobs() dead-job reporting via a new default-on “require prior report” mechanism, and add client-side job meta validation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
nvflare/private/fed/server/job_runner.py Classifies deployment timeouts as failures; adds strict start-job reply handling and timeout-aware min_sites logic.
nvflare/private/fed/server/admin.py Updates check_client_replies() API to return timed-out clients (strict) and fixes non-strict lookup robustness.
nvflare/private/fed/server/fed_server.py Changes dead-job detection to require prior positive report by default; moves tracking to server attribute with cleanup.
nvflare/private/fed/client/client_run_manager.py Validates JOB_CLIENTS job meta presence/type with clearer errors.
nvflare/apis/fl_constant.py Adds config var names for the new server-side safety flags.
docs/user_guide/timeout_troubleshooting.rst Documents the new startup/dead-job safety flags and guidance.
docs/programming_guide/timeouts.rst Adds a dedicated section documenting the new flags and recommended usage.
tests/unit_test/private/fed/server/admin_test.py Adds unit coverage for new check_client_replies() strict/non-strict behaviors.
tests/unit_test/private/fed/server/job_runner_test.py Adds unit coverage for strict flag wiring and start-job timeout tolerance/exclusion behavior.
tests/unit_test/private/fed/server/job_runner_deploy_test.py Adds focused unit coverage for deployment-timeout classification and min_sites/required_sites abort logic.
tests/unit_test/private/fed/server/fed_server_test.py Adds unit coverage for prior-report gating and tracking cleanup behavior in _sync_client_jobs().
tests/unit_test/private/fed/client/client_run_manager_test.py Adds unit coverage for new meta validation behavior in get_job_clients().
tests/unit_test/private/fed/client/init.py Adds package marker for new client unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ctive client exclusion, and dead-detection debounce (NVIDIA#4209)

## Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays
on Frontier) abort in
Round 0 due to a cascading startup failure chain.  The root sequence is:

1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from
relay-connected clients
2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` —
not a failure — so
   timed-out clients silently appear to have been deployed
3. **`_start_run()`** tries to start those clients; they again time out,
and
   `check_client_replies()` ignores the `None` reply
4. **`_sync_client_jobs()`** fires dead-job notification on the very
first heartbeat with
   no startup grace period
5. FedAvg requires 144/144 — one or two missing clients → abort
6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not
iterable` when
`get_job_clients()` receives `None` metadata from an already-aborted job

PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in
2.7.2) address the
transport layer.  This PR addresses the remaining job lifecycle layer.

---

## Fixes Included

### 1 — `_deploy_job()`: Treat deployment timeout as failure
(`job_runner.py`)

**Root bug**: `reply=None` was logged as `"unknown"` and excluded from
`failed_clients`,
so timed-out clients counted as "successfully deployed" for the
`min_sites` check.

**Fix**: Add timed-out clients to `failed_clients` with a `"deployment
timeout"` label.
The existing `min_sites` / `required_sites` logic then correctly decides
whether to abort.

### 2 — `check_client_replies()`: Return timed-out clients instead of
raising (`admin.py`)

**Root bug**: In strict mode, any timeout raised immediately, aborting
the whole job even
when the remaining active clients satisfied `min_sites`.

**Fix**: In strict mode, collect timed-out clients into a return list
rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also
fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional
`zip()`.

New signature: `check_client_replies(...) -> List[str]` (timed-out
client names; empty = none).

### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation
(`job_runner.py`)

**Root bug**: A start-job timeout under strict mode aborted the entire
job with no
tolerance for stragglers within `min_sites` bounds.

**Fix**: Use the returned timed-out list from `check_client_replies()`.
If remaining
active clients >= `min_sites`, log a warning and proceed. Only abort
when below tolerance.

### 4 — `_sync_client_jobs()`: Require-prior-report default changed to
`True` (`fed_server.py`)

**Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to
`False`, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

**Fix**: Default changed to `True`. Operators who want the aggressive
legacy detection
can set it to `False` explicitly.

### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of
`job_info` dict (`fed_server.py`)

**Root bug**: Positive-observation tracking was stored as
`job_info["_reported_clients"]`,
injecting algorithm state into a data dict with no corresponding
`RunProcessKey` constant.

**Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]`
on `FederatedServer`.
Stale entries are purged whenever a job is no longer in `run_processes`.

### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation
(`client_run_manager.py`)

Raises `RuntimeError` with a descriptive message instead of an opaque
`TypeError` when
`JOB_CLIENTS` is absent or the wrong type.

---

## Configuration Recommendations (No Code Change Needed)

| Setting | Recommended value | Effect |
|---|---|---|
| `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few
startup stragglers |
| `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments
time to complete |
| `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced,
straggler clients excluded |
| `sync_client_jobs_require_previous_report` | `true` (now the default)
| Prevents premature dead-job from startup delay |
| `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging |
Disconnects stalled relay connections |

---

## Files Changed

- `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout
as failure; `_start_run()` selective exclusion
- `nvflare/private/fed/server/admin.py` — `check_client_replies()`
returns timed-out list; dict-keyed non-strict path
- `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()`
default `True`; `_job_reported_clients` attr; stale cleanup
- `nvflare/private/fed/client/client_run_manager.py` — explicit meta
validation in `get_job_clients()`

---

## Test Coverage

New and updated unit tests with both positive and negative cases:

| File | Tests | What they cover |
|---|---|---|
| `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error
still raises; reorder OK |
| `job_runner_test.py` | 4 | strict flag wiring; timeout within
tolerance → warn; timeout below tolerance → raise |
| `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as
failure; OK reply not failed; mixed outcomes; detail label; min_sites
with timeouts; integration sequence |
| `fed_server_test.py` | 5 | Default requires-prior-report; legacy
explicit-False still fires; tracking in server attr not job_info; stale
cleanup |

All 29 targeted unit tests pass.

## Test Plan

- [x] Unit tests for each changed function (positive + negative)
- [x] New `job_runner_deploy_test.py` covering deployment timeout
classification end-to-end
- [x] All 29 targeted unit tests pass
- [ ] Hierarchical staging run with all flags at default
- [ ] Hierarchical staging run with `strict_start_job_reply_check=true`
and reduced `min_clients`
- [ ] Verify no regression on standard (non-hierarchical) FL jobs

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@pcnudde
Copy link
Collaborator

pcnudde commented Mar 11, 2026

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants