Skip to content

Under some circumstances, create_test --retry leads to multiple instances of a test in the job queue #4862

@billsacks

Description

@billsacks

@fischer-ncar reported some situations where his ERS tests were ending up with negative STOP_N. Digging into this, I think what happened was that multiple instances of the test were trying to run at the same time, and so were stomping on each other. He ran create_test with --retry 2, and I saw the following in his CaseStatus, which indicates that:

  • The initial submission happened
  • About 6 hours later (and exactly 6 hours after it seems the create_test job started), a second submission happened (without the first submission having run yet)... a key point here is that his create_test job was submitted to a batch node with a wallclock time of 6 hours
  • A few seconds later, a third submission happened
  • The following day, the initial resubmission ran to completion
  • A bit later, the two resubmissions ran within a few minutes of each other, with overlapping run times; this led to a failure
Relevant excerpt from CaseStatus
2025-09-26 11:55:27: case.submit starting case.test:3242529.desched1
2025-09-26 11:55:27: case.submit success case.test:3242529.desched1
2025-09-26 17:29:32: case.submit starting case.test:3244839.desched1
2025-09-26 17:29:32: case.submit success case.test:3244839.desched1
2025-09-26 17:29:35: case.submit starting case.test:3244895.desched1
2025-09-26 17:29:35: case.submit success case.test:3244895.desched1

2025-09-27 11:27:01: case.run starting 3242529.desched1
2025-09-27 11:27:06: model execution starting 3242529.desched1
2025-09-27 11:36:57: model execution success 3242529.desched1
2025-09-27 11:36:57: case.run success 3242529.desched1
2025-09-27 11:37:18: case.run starting 3242529.desched1
2025-09-27 11:37:21: model execution starting 3242529.desched1
2025-09-27 11:42:02: model execution success 3242529.desched1
2025-09-27 11:42:02: case.run success 3242529.desched1

2025-09-27 15:30:46: case.run starting 3244839.desched1
2025-09-27 15:30:49: model execution starting 3244839.desched1

2025-09-27 15:40:23: case.run starting 3244895.desched1
2025-09-27 15:40:27: model execution starting 3244895.desched1

2025-09-27 15:40:30: model execution success 3244839.desched1
2025-09-27 15:40:30: case.run success 3244839.desched1
2025-09-27 15:40:50: case.run starting 3244839.desched1
2025-09-27 15:40:54: model execution starting 3244839.desched1
2025-09-27 15:45:33: model execution success 3244839.desched1
2025-09-27 15:45:33: case.run success 3244839.desched1

2025-09-27 15:50:04: model execution success 3244895.desched1
2025-09-27 15:50:04: case.run success 3244895.desched1

My hypothesis is that as the create_test was about to run out of wallclock time, there was some signal that caused the wait_for_test thread to abort... then the parent thread saw that this child thread had finished without a successful test result, which was taken to imply test failure, so it resubmitted the job... and then the same thing happened a few seconds later. (But I don't see any instances of "RECEIVED SIGNAL" in the output files, which I was kind of expecting to see based on a read of the code, so I'm not confident about what's happening.)

Chris and I have both reproduced this consistently by submitting a create_test job that has very little wallclock time - e.g., qcmd -l walltime=00:15:00 -- ./create_test ERS.TL319_t232.G_JRA.derecho_gnu.cice-default --retry 2. (This uses the qcmd script on NCAR's derecho machine, which submits the given command as a batch job. I notice that qcmd does various things with signal handling itself, but I have also reproduced it using a simpler mechanism that doesn't do this signal handling.)

I'm not sure if this behavior is derecho-specific or more general. I tried looking into it but I don't understand the wait-for-tests threading well enough to understand what's causing this or how to fix it. It seems like the best behavior in this situation is for the wait-for-tests to exit without triggering any test resubmission. That will mean that the retry functionality won't work (and anything else that wait-for-tests usually does... I forget what that might be), but that seems better than getting multiple instances of the test job stomping on each other.

@jgfouca do you have ideas about what might be going on here and possibly how to fix this?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions