-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Jira: https://asfdaac.atlassian.net/browse/TOOL-3465
Note: The above link is accessible only to members of ASF.
It seems that a HyP3 job may get stuck in PENDING status if the step function ends with an Aborted state. We encountered this with https://hyp3-opera-disp-sandbox.asf.alaska.edu/jobs/d3ebab24-25c9-44ee-a16a-05c683284edc which includes a Map state with three iterations, all of which show an Aborted status in the step function execution console (or "Canceled" if you're looking at the graph view), though the overall step function execution still shows Failed. The error in this case was that each iteration consists of a Task state which attempts a batch:submitJob call, which was given a non-string parameter, so the step failed with:
An error occurred while executing the state 'OPERA_DISP_TMS_CREATE_MEASUREMENT_GEOTIFF_SUBMIT_JOB' (entered at the event id #23). The Parameters '{...}' could not be used to start the Task: [The value for the field 'frame_id' must be a STRING]
I know when we first implemented fan-out workflows using the Map state (for SRG_TIME_SERIES.yml) that we specifically verified that the HyP3 job fails when one of the iterations fails due to a runtime error in the container, but maybe in this case they just get Aborted because the batch:submitJob Task failed to start?
I'm still surprised that this doesn't qualify as an error that would be handled by the Catch field in our step function definition. We should keep an eye out for any more stuck-in-pending jobs, in case this is a recent change to how step functions work.