Describe the bug
When the Iris controller restarts and restores from an S3 checkpoint, child jobs whose parent rows were not captured in the checkpoint are accepted silently with a broken hierarchy: parent_job_id is set to NULL but depth is computed from the job name path. The root job never appears in the dashboard (which filters depth = 1), even though the child tasks are running correctly.
To Reproduce
- Submit a multi-level job hierarchy (root → depth-2 → depth-3 zephyr tasks).
- Controller OOMs before the hourly checkpoint captures the root and depth-2 rows.
- Running processes (e.g. zephyr) reconnect and resubmit depth-3 child jobs.
submit_job in transitions.py:980–984 detects the parent is absent and sets parent_job_id = None rather than rejecting the submission.
launch_job in service.py:846–853 only rejects submissions whose parent is terminated, not missing (_job_state() returns None for absent parents, which passes the guard silently).
- Jobs are inserted with correct
depth but null parent_job_id; dashboard query WHERE depth = 1 finds no root → "0 jobs" for that run.
Expected behavior
Submitting a child job whose parent does not exist in the DB should be rejected with FAILED_PRECONDITION, the same as submitting to a terminated parent.
Additional context
service.py:846: if job_id.parent: / if parent_state is not None and parent_state in TERMINAL_JOB_STATES — missing branch for parent_state is None
transitions.py:983: if parent_exists is None: parent_job_id = None — silently drops the link instead of raising
- Fix: in
launch_job, treat parent_state is None (parent absent) as a hard error, matching the terminated-parent case.
Describe the bug
When the Iris controller restarts and restores from an S3 checkpoint, child jobs whose parent rows were not captured in the checkpoint are accepted silently with a broken hierarchy:
parent_job_idis set toNULLbutdepthis computed from the job name path. The root job never appears in the dashboard (which filtersdepth = 1), even though the child tasks are running correctly.To Reproduce
submit_jobintransitions.py:980–984detects the parent is absent and setsparent_job_id = Nonerather than rejecting the submission.launch_jobinservice.py:846–853only rejects submissions whose parent is terminated, not missing (_job_state()returnsNonefor absent parents, which passes the guard silently).depthbut nullparent_job_id; dashboard queryWHERE depth = 1finds no root → "0 jobs" for that run.Expected behavior
Submitting a child job whose parent does not exist in the DB should be rejected with
FAILED_PRECONDITION, the same as submitting to a terminated parent.Additional context
service.py:846:if job_id.parent:/if parent_state is not None and parent_state in TERMINAL_JOB_STATES— missing branch forparent_state is Nonetransitions.py:983:if parent_exists is None: parent_job_id = None— silently drops the link instead of raisinglaunch_job, treatparent_state is None(parent absent) as a hard error, matching the terminated-parent case.