Job status tracking is currently fragile and relies on symbols being passed around.
Clear fragility issues:
- Using PBS,
job_status tries two qstat formats and returns :RUNNING on any error, silently masking failures as running jobs.
- Using Slurm, empty
squeue output maps to :COMPLETED, conflating "job finished" with "job ID no longer tracked by the scheduler".
- Failure states are asymmetric: PBS returns
:FAILED; Slurm does not.
We should:
- store the job status as an enum instead of a symbol
- store full iteration results in a struct
- harden job_status calls, potentially dispatching on the backends
mutable struct IterationResult
iter::Int
member_states::Vector{JobState} # one entry per submitted ensemble member
n_failed::Int
n_rerun::Int
end
Job status tracking is currently fragile and relies on symbols being passed around.
Clear fragility issues:
job_statustries twoqstatformats and returns:RUNNINGon any error, silently masking failures as running jobs.squeueoutput maps to:COMPLETED, conflating "job finished" with "job ID no longer tracked by the scheduler".:FAILED; Slurm does not.We should: