Skip to content

Harden job lifecycle tracking #284

@nefrathenrici

Description

@nefrathenrici

Job status tracking is currently fragile and relies on symbols being passed around.
Clear fragility issues:

  • Using PBS, job_status tries two qstat formats and returns :RUNNING on any error, silently masking failures as running jobs.
  • Using Slurm, empty squeue output maps to :COMPLETED, conflating "job finished" with "job ID no longer tracked by the scheduler".
  • Failure states are asymmetric: PBS returns :FAILED; Slurm does not.

We should:

  • store the job status as an enum instead of a symbol
  • store full iteration results in a struct
  • harden job_status calls, potentially dispatching on the backends
mutable struct IterationResult
    iter::Int
    member_states::Vector{JobState}   # one entry per submitted ensemble member
    n_failed::Int
    n_rerun::Int
end

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions