-
Notifications
You must be signed in to change notification settings - Fork 219
Description
@fischer-ncar reported some situations where his ERS tests were ending up with negative STOP_N. Digging into this, I think what happened was that multiple instances of the test were trying to run at the same time, and so were stomping on each other. He ran create_test with --retry 2, and I saw the following in his CaseStatus, which indicates that:
- The initial submission happened
- About 6 hours later (and exactly 6 hours after it seems the create_test job started), a second submission happened (without the first submission having run yet)... a key point here is that his create_test job was submitted to a batch node with a wallclock time of 6 hours
- A few seconds later, a third submission happened
- The following day, the initial resubmission ran to completion
- A bit later, the two resubmissions ran within a few minutes of each other, with overlapping run times; this led to a failure
Relevant excerpt from CaseStatus
2025-09-26 11:55:27: case.submit starting case.test:3242529.desched1
2025-09-26 11:55:27: case.submit success case.test:3242529.desched1
2025-09-26 17:29:32: case.submit starting case.test:3244839.desched1
2025-09-26 17:29:32: case.submit success case.test:3244839.desched1
2025-09-26 17:29:35: case.submit starting case.test:3244895.desched1
2025-09-26 17:29:35: case.submit success case.test:3244895.desched1
2025-09-27 11:27:01: case.run starting 3242529.desched1
2025-09-27 11:27:06: model execution starting 3242529.desched1
2025-09-27 11:36:57: model execution success 3242529.desched1
2025-09-27 11:36:57: case.run success 3242529.desched1
2025-09-27 11:37:18: case.run starting 3242529.desched1
2025-09-27 11:37:21: model execution starting 3242529.desched1
2025-09-27 11:42:02: model execution success 3242529.desched1
2025-09-27 11:42:02: case.run success 3242529.desched1
2025-09-27 15:30:46: case.run starting 3244839.desched1
2025-09-27 15:30:49: model execution starting 3244839.desched1
2025-09-27 15:40:23: case.run starting 3244895.desched1
2025-09-27 15:40:27: model execution starting 3244895.desched1
2025-09-27 15:40:30: model execution success 3244839.desched1
2025-09-27 15:40:30: case.run success 3244839.desched1
2025-09-27 15:40:50: case.run starting 3244839.desched1
2025-09-27 15:40:54: model execution starting 3244839.desched1
2025-09-27 15:45:33: model execution success 3244839.desched1
2025-09-27 15:45:33: case.run success 3244839.desched1
2025-09-27 15:50:04: model execution success 3244895.desched1
2025-09-27 15:50:04: case.run success 3244895.desched1
My hypothesis is that as the create_test was about to run out of wallclock time, there was some signal that caused the wait_for_test thread to abort... then the parent thread saw that this child thread had finished without a successful test result, which was taken to imply test failure, so it resubmitted the job... and then the same thing happened a few seconds later. (But I don't see any instances of "RECEIVED SIGNAL" in the output files, which I was kind of expecting to see based on a read of the code, so I'm not confident about what's happening.)
Chris and I have both reproduced this consistently by submitting a create_test job that has very little wallclock time - e.g., qcmd -l walltime=00:15:00 -- ./create_test ERS.TL319_t232.G_JRA.derecho_gnu.cice-default --retry 2. (This uses the qcmd script on NCAR's derecho machine, which submits the given command as a batch job. I notice that qcmd does various things with signal handling itself, but I have also reproduced it using a simpler mechanism that doesn't do this signal handling.)
I'm not sure if this behavior is derecho-specific or more general. I tried looking into it but I don't understand the wait-for-tests threading well enough to understand what's causing this or how to fix it. It seems like the best behavior in this situation is for the wait-for-tests to exit without triggering any test resubmission. That will mean that the retry functionality won't work (and anything else that wait-for-tests usually does... I forget what that might be), but that seems better than getting multiple instances of the test job stomping on each other.
@jgfouca do you have ideas about what might be going on here and possibly how to fix this?