Skip to content

[iris] Child job submission silently accepted when parent is absent from DB #4559

@rjpower

Description

@rjpower

Describe the bug

When the Iris controller restarts and restores from an S3 checkpoint, child jobs whose parent rows were not captured in the checkpoint are accepted silently with a broken hierarchy: parent_job_id is set to NULL but depth is computed from the job name path. The root job never appears in the dashboard (which filters depth = 1), even though the child tasks are running correctly.

To Reproduce

  1. Submit a multi-level job hierarchy (root → depth-2 → depth-3 zephyr tasks).
  2. Controller OOMs before the hourly checkpoint captures the root and depth-2 rows.
  3. Running processes (e.g. zephyr) reconnect and resubmit depth-3 child jobs.
  4. submit_job in transitions.py:980–984 detects the parent is absent and sets parent_job_id = None rather than rejecting the submission.
  5. launch_job in service.py:846–853 only rejects submissions whose parent is terminated, not missing (_job_state() returns None for absent parents, which passes the guard silently).
  6. Jobs are inserted with correct depth but null parent_job_id; dashboard query WHERE depth = 1 finds no root → "0 jobs" for that run.

Expected behavior

Submitting a child job whose parent does not exist in the DB should be rejected with FAILED_PRECONDITION, the same as submitting to a terminated parent.

Additional context

  • service.py:846: if job_id.parent: / if parent_state is not None and parent_state in TERMINAL_JOB_STATES — missing branch for parent_state is None
  • transitions.py:983: if parent_exists is None: parent_job_id = None — silently drops the link instead of raising
  • Fix: in launch_job, treat parent_state is None (parent absent) as a hard error, matching the terminated-parent case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-generatedCreated by automation/agentbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions