Skip to content

Design document for Job Launcher and Job Handle (TA: NVFlare developers)#4282

Open
IsaacYangSLA wants to merge 1 commit intoNVIDIA:mainfrom
IsaacYangSLA:job_launcher_handle_design
Open

Design document for Job Launcher and Job Handle (TA: NVFlare developers)#4282
IsaacYangSLA wants to merge 1 commit intoNVIDIA:mainfrom
IsaacYangSLA:job_launcher_handle_design

Conversation

@IsaacYangSLA
Copy link
Collaborator

Description

The abstract layers for JobLauncher/JobHandle to support three environments, and how the upper layers (client and server) call the concrete classes to launch job.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

Copilot AI review requested due to automatic review settings March 9, 2026 22:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design document describing NVFlare’s JobLauncherSpec / JobHandleSpec abstractions and the Process/Docker/K8s launcher implementations used by the server and client runtime.

Changes:

  • Introduces a comprehensive design doc for job launching/handling interfaces and lifecycle flow.
  • Documents launcher selection via BEFORE_JOB_LAUNCH event and how return codes are resolved.
  • Describes Process/Docker/K8s launcher implementations and example configurations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +176 to +180
**Abort path** (`_terminate_job`):

1. Wait up to 10 seconds for the child to exit gracefully (polling `job_handle.poll()`).
2. Call `job_handle.terminate()`.

Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client abort path is described as polling job_handle.poll() for up to 10 seconds before terminating, but _terminate_job() currently waits (by checking run_processes removal) and then unconditionally calls job_handle.terminate()—it does not poll the handle. Please update this section to reflect the current behavior, or adjust the implementation if polling is the intended design.

Copilot uses AI. Check for mistakes.
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR introduces the JobLauncher/JobHandle abstraction layer enabling NVFlare to launch federated jobs as subprocesses, Docker containers, or Kubernetes pods through a unified event-driven interface, accompanied by a design document and supporting bare-environment stubs.

Key concerns:

  • K8sJobHandle.enter_states() input validation is silently broken — wrapping isinstance(js, JobState) in extra square brackets means all(...) always returns True, so invalid types are never caught
  • K8sJobHandle._make_manifest() can embed None into container["args"] both from DEFAULT_CONTAINER_ARGS_MODULE_ARGS_DICT values and from job_config.get("command") falling back to None; the Kubernetes API rejects pod specs with null args entries
  • BEResourceManager.allocate_resources returns True (bool) instead of the required dict, violating the ResourceManagerSpec contract and breaking any caller that forwards the result to free_resources
  • Several previously-raised issues (poll() returning JobState instead of JobReturnCode, launch_job returning a handle on ApiException, data_pvc_dict None crash, hardcoded 600 s cell-creation timeout) remain unaddressed

Confidence Score: 1/5

  • Not safe to merge — multiple critical runtime bugs in the K8s launcher and resource manager remain unresolved.
  • The K8s launcher has several compounding bugs: type validation in enter_states is a no-op, None values are silently injected into pod args causing immediate API rejection, poll() returns a JobState instead of JobReturnCode meaning return-code comparisons always fail, and launch_job returns a live handle on ApiException rather than None. BEResourceManager.allocate_resources also returns bool instead of dict. Taken together these bugs would cause K8s job launches to fail silently or behave incorrectly end-to-end.
  • nvflare/app_opt/job_launcher/k8s_launcher.py and nvflare/app_common/resource_managers/BE_resource_manager.py require the most attention.

Important Files Changed

Filename Overview
nvflare/app_opt/job_launcher/k8s_launcher.py New K8s job launcher implementation with multiple critical issues: enter_states validation is a no-op due to extra brackets, poll() returns JobState instead of JobReturnCode, launch_job returns a handle on ApiException, None values in container args when defaults are used, and a data_pvc_dict NoneType crash at construction time.
nvflare/apis/job_launcher_spec.py New spec file defining JobHandleSpec, JobLauncherSpec, JobReturnCode, and JobProcessArgs; structure is sound but thread-safety requirements are not documented on JobHandleSpec.
nvflare/app_common/resource_managers/BE_resource_manager.py New bare-environment resource manager stub; allocate_resources returns True (bool) instead of dict, violating the ResourceManagerSpec contract and breaking any caller that passes the return value to free_resources.
nvflare/app_common/resource_consumers/BE_resource_consumer.py New bare-environment resource consumer stub; correctly implements the no-op consume() method.
nvflare/app_common/resource_managers/gpu_resource_manager.py Adds ignore_host parameter to GPUResourceManager, allowing it to skip host GPU validation; logic is sound with proper guard rails around the new flag.
nvflare/private/fed/client/communicator.py Cell-creation timeout increased from 15s to 600s with sleep raised from 0.5s to 10s; values are hardcoded with no configurable override, meaning all environments (including local process mode) now wait up to 10 minutes before surfacing cell init failures.
docs/design/JobLauncher_and_JobHandle.md New design document covering JobLauncher/JobHandle architecture; has several doc-level inconsistencies including job_meta undefined in handle_event snippet, Docker exited→SUCCESS mapping ignoring exit codes, and missing thread-safety requirements for concurrent terminate()/wait() calls.

Sequence Diagram

sequenceDiagram
    participant UL as Upper Layer (ServerEngine / ClientExecutor)
    participant GJL as get_job_launcher()
    participant EV as Event System (BEFORE_JOB_LAUNCH)
    participant JL as JobLauncherSpec (concrete)
    participant JH as JobHandleSpec (concrete)
    participant BG as Background Thread

    UL->>UL: build JOB_PROCESS_ARGS in FLContext
    UL->>GJL: get_job_launcher(job_meta, fl_ctx)
    GJL->>EV: fire BEFORE_JOB_LAUNCH
    EV-->>JL: handle_event → add_launcher(self, fl_ctx)
    GJL-->>UL: return first registered launcher

    UL->>JL: launch_job(job_meta, fl_ctx)
    JL->>JH: create handle (Process / Docker / K8s)
    JL->>JH: enter_states([RUNNING], timeout)
    JL-->>UL: return job_handle (or None on failure)

    UL->>BG: spawn thread → job_handle.wait()
    BG->>JH: wait() — blocks until SUCCEEDED / TERMINATED

    alt job aborted
        UL->>JH: terminate()
        JH-->>UL: None
    end

    BG->>JH: poll() → JobReturnCode
    BG-->>UL: notify completion with return code
Loading

Comments Outside Diff (1)

  1. nvflare/app_opt/job_launcher/k8s_launcher.py, line 171-172 (link)

    enter_states type-check is always True — validation is a no-op

    The expression all([isinstance(js, JobState)] for js in job_states_to_enter) wraps the boolean in a single-element list before yielding it to all(). Because a non-empty list [False] is truthy in Python, all(...) will always return True regardless of the actual type of js. The ValueError can never be raised, meaning invalid types (e.g., a plain string) passed into enter_states will silently pass the check and only surface as a confusing failure later inside the while loop.

    The fix is to remove the extra square brackets so the generator yields plain booleans:

Last reviewed commit: 80cb031

| Method | Signature | Semantics |
|--------|-----------|-----------|
| `terminate()` | `() -> None` | Stop the job immediately. |
| `poll()` | `() -> JobReturnCode` | Non-blocking query for the job's current return code. Returns `UNKNOWN` while still running. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the running instance return "UNKNOWN" instead of "RUNNING" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the 'exit code' of a handle. In process case, before the process exits, there is no exit code and it will be UNKNOWN. If the process exits normally, the exit code will be 0. The high-level status code will be SUCCESS. In k8s, there is no clear exit code. So once we detect POD is in SUCCEEDED or TERMINATED state, we know it exits and then we set the high-level status code (SUCCESS or TERMINATED).

@IsaacYangSLA IsaacYangSLA force-pushed the job_launcher_handle_design branch from 00c9d80 to c9657bb Compare March 10, 2026 16:56
Comment on lines +33 to +34
def allocate_resources(self, resource_requirement: dict, token: str, fl_ctx: FLContext) -> dict:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocate_resources returns bool instead of dict

ResourceManagerSpec.allocate_resources is declared -> dict (it should return the allocated resources for later de-allocation). Returning True here violates the type contract; callers that try to iterate or look up keys in the returned value will raise a TypeError or AttributeError.

For a bare-environment no-op implementation the correct stub value is an empty dict:

Suggested change
def allocate_resources(self, resource_requirement: dict, token: str, fl_ctx: FLContext) -> dict:
return True
def allocate_resources(self, resource_requirement: dict, token: str, fl_ctx: FLContext) -> dict:
return {}


| Docker Status | JobReturnCode |
|---------------|---------------|
| `created` | `UNKNOWN` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why status can't reflect the real status ?

Pod phase mapping:

| Pod Phase | JobState | JobReturnCode |
|-----------|----------|---------------|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same where why can't the status starting,running instead of unknown

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two concepts. One is the state the job is in. The other is once the job exits, what exit code (status code is returns. The POD only has states, so we have to map the state to both concepts.

Comment on lines 273 to 277
start = time.time()
while not self.cell:
self.logger.info("Waiting for the client cell to be created.")
if time.time() - start > 15.0:
if time.time() - start > 600:
raise RuntimeError("Client cell could not be created. Failed to login the client.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cell-creation timeout increased from 15 s → 600 s, sleep from 0.5 s → 10 s

The new values change the failure-detection latency from ~15 seconds to up to 10 minutes. While longer startup times are expected in K8s environments, this change affects all environments (process and Docker launchers included). A client whose cell genuinely fails to initialize (network misconfiguration, port conflict, etc.) will now block silently for 10 minutes before surfacing an error, which significantly degrades the developer/operator experience.

Consider either:

  • making the timeout configurable via a constructor parameter so that each deployment can tune it, or
  • only applying the longer timeout on the K8s code path (e.g., gated by a flag or a subclass override).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are needed default for large models, Docker and K8s env can overwrite with shorter timeout if needed. @greptileai

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood that large models need longer initialization, but the current implementation doesn't support overriding this timeout—it's hardcoded at 600s in client_registration.

The Communicator.__init__ already has a timeout parameter (used for message timeouts), but the cell-creation timeout is a separate hardcoded value. If the design intent is to allow environment-specific overrides, consider:

Suggested change
start = time.time()
while not self.cell:
self.logger.info("Waiting for the client cell to be created.")
if time.time() - start > 15.0:
if time.time() - start > 600:
raise RuntimeError("Client cell could not be created. Failed to login the client.")
def __init__(
self,
client_config=None,
secure_train=False,
client_state_processors: Optional[List[Filter]] = None,
compression=None,
cell: Cell = None,
client_register_interval=2,
timeout=5.0,
maint_msg_timeout=5.0,
cell_creation_timeout=600.0,
):

Then use self.cell_creation_timeout in the registration loop. This would actually enable the configurability you're describing.

Comment on lines +140 to 147
if job_config.get("module_args") is None:
self.container_args_module_args_dict = DEFAULT_CONTAINER_ARGS_MODULE_ARGS_DICT
else:
self.container_args_module_args_dict = job_config.get("module_args")
self.container_args_module_args_dict_as_list = list()
for k, v in self.container_args_module_args_dict.items():
self.container_args_module_args_dict_as_list.append(k)
self.container_args_module_args_dict_as_list.append(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None values in pod container args when default module args are used

When job_config.get("module_args") is None, DEFAULT_CONTAINER_ARGS_MODULE_ARGS_DICT is used. Its values are all None:

DEFAULT_CONTAINER_ARGS_MODULE_ARGS_DICT = {"-m": None, "-w": None, ...}

The loop on lines 144–147 appends both keys and their None values to container_args_module_args_dict_as_list:

self.container_args_module_args_dict_as_list.append(k)   # "-m"
self.container_args_module_args_dict_as_list.append(v)   # None

This produces a list like ["-m", None, "-w", None, ...] that is later placed into container["args"]. The Kubernetes API will reject a pod spec whose args array contains null entries, causing pod creation to fail immediately. The default dict should either not be used (raise an error if module_args is missing) or its None values should be filtered out before constructing the args list:

for k, v in self.container_args_module_args_dict.items():
    if v is not None:
        self.container_args_module_args_dict_as_list.append(k)
        self.container_args_module_args_dict_as_list.append(v)

@IsaacYangSLA IsaacYangSLA force-pushed the job_launcher_handle_design branch from 0b592ec to 246d3a9 Compare March 10, 2026 21:27
Comment on lines 285 to +291
try:
self.core_v1.create_namespaced_pod(body=job_handle.get_manifest(), namespace=self.namespace)
if job_handle.enter_states([JobState.RUNNING], timeout=self.timeout):
return job_handle
else:
job_handle.terminate()
return None
except ApiException:
self.core_v1.create_namespaced_pod(body=pod_manifest, namespace=self.namespace)
job_handle.enter_states([JobState.RUNNING], timeout=self.timeout)
return job_handle
except ApiException as e:
job_handle.terminate()
return None
return job_handle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enter_states timeout not handled — pod-not-RUNNING path silently returns a handle

The previous implementation explicitly checked the return value of enter_states([JobState.RUNNING], ...) and returned None on timeout:

# old code
if job_handle.enter_states([JobState.RUNNING], timeout=self.timeout):
    return job_handle
else:
    job_handle.terminate()
    return None

The new code drops that check entirely:

job_handle.enter_states([JobState.RUNNING], timeout=self.timeout)
return job_handle   # returned even when enter_states() returned False (timeout)

enter_states returns False when timeout is not None and the pod never reaches RUNNING within that window (see K8sJobHandle.enter_states, line 177–178). The caller (ServerEngine / ClientExecutor) stores the returned handle and starts a background thread calling handle.wait(), which would then loop indefinitely on a stuck/unschedulable pod. Meanwhile, the engine believes the job is running normally.

Consider restoring the explicit check, for example:

if not job_handle.enter_states([JobState.RUNNING], timeout=self.timeout):
    job_handle.terminate()
    return None
return job_handle

Comment on lines +236 to +241
with open(data_pvc_file_path, "rt") as f:
data_pvc_dict = yaml.safe_load(f)
# data_pvc_dict will be pvc: mountPath
# currently, support one pvc and always mount to /var/tmp/nvflare/data
# ie, ignore the mountPath in data_pvc_dict
self.data_pvc = list(data_pvc_dict.keys())[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_pvc_dict can be None or empty, causing an unhandled crash

yaml.safe_load() returns None when the file is empty. Calling list(None.keys()) will raise AttributeError. Even if the file is non-empty but has no top-level keys (e.g., just a YAML comment), data_pvc_dict will be None or {}, and list({}.keys())[0] raises IndexError. Neither case is caught, so the launcher crashes at construction time with an opaque traceback rather than a clear error message.

A simple guard would surface the misconfiguration immediately:

with open(data_pvc_file_path, "rt") as f:
    data_pvc_dict = yaml.safe_load(f)
if not data_pvc_dict:
    raise ValueError(f"data_pvc_file_path '{data_pvc_file_path}' is empty or contains no PVC entries.")
self.data_pvc = list(data_pvc_dict.keys())[0]

Job Launcher for K8s environement
GPU, image, pvc updated and working
Add codes

Fix coding style issues
@IsaacYangSLA IsaacYangSLA force-pushed the job_launcher_handle_design branch from 246d3a9 to 80cb031 Compare March 10, 2026 21:32
"imagePullPolicy": "Always",
}
]
self.container_args_python_args_list = ["-u", "-m", job_config.get("command")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None silently injected into python args when command is absent

job_config.get("command") returns None if the key is missing, producing ["-u", "-m", None] in container_args_python_args_list. This list is later concatenated directly into container["args"] (line 155–159), which the Kubernetes API will reject because pod args cannot contain null entries.

While K8sJobLauncher.launch_job always passes an explicit "command" key, anyone constructing K8sJobHandle directly (or writing a subclass) will hit this silently. Consider adding an explicit guard:

command = job_config.get("command")
if not command:
    raise ValueError("job_config must contain a non-empty 'command' key")
self.container_args_python_args_list = ["-u", "-m", command]

Comment on lines +33 to +34
def allocate_resources(self, resource_requirement: dict, token: str, fl_ctx: FLContext) -> dict:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocate_resources return type violates ResourceManagerSpec contract

ResourceManagerSpec.allocate_resources is declared -> dict and callers are expected to iterate or store the returned dict for later de-allocation via free_resources. Returning True here breaks the type contract; any caller that passes the return value to free_resources(resources, ...) will receive a TypeError when it tries to iterate over a boolean.

For a bare-environment no-op, the correct stub is an empty dict:

Suggested change
def allocate_resources(self, resource_requirement: dict, token: str, fl_ctx: FLContext) -> dict:
return True
return {}

Comment on lines +129 to +130
self._stuck_count = -10
self._max_stuck_count = self.timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number _stuck_count = -10 unexplained

_stuck_count is initialized to -10, which means the first 10 times a pod is observed in the Pending phase do not count toward the stuck threshold. This acts as an implicit grace period, but the logic is undocumented. Readers must reverse-engineer the intent. Consider extracting this into a named constant or constructor parameter with a comment explaining the grace-period semantics:

_STUCK_COUNT_GRACE_PERIOD = -10  # allow 10 Pending-phase polls before declaring stuck

# in __init__:
self._stuck_count = _STUCK_COUNT_GRACE_PERIOD

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants