Under some circumstances, `create_test --retry` leads to multiple instances of a test in the job queue

@fischer-ncar reported some situations where his ERS tests were ending up with negative STOP_N. Digging into this, I think what happened was that multiple instances of the test were trying to run at the same time, and so were stomping on each other. He ran `create_test` with `--retry 2`, and I saw the following in his CaseStatus, which indicates that:
- The initial submission happened
- About 6 hours later (and exactly 6 hours after it seems the create_test job started), a second submission happened (without the first submission having run yet)... a key point here is that his create_test job was submitted to a batch node with a wallclock time of 6 hours
- A few seconds later, a third submission happened
- The following day, the initial resubmission ran to completion
- A bit later, the two resubmissions ran within a few minutes of each other, with overlapping run times; this led to a failure

<details>
<summary>Relevant excerpt from CaseStatus</summary>

```
2025-09-26 11:55:27: case.submit starting case.test:3242529.desched1
2025-09-26 11:55:27: case.submit success case.test:3242529.desched1
2025-09-26 17:29:32: case.submit starting case.test:3244839.desched1
2025-09-26 17:29:32: case.submit success case.test:3244839.desched1
2025-09-26 17:29:35: case.submit starting case.test:3244895.desched1
2025-09-26 17:29:35: case.submit success case.test:3244895.desched1

2025-09-27 11:27:01: case.run starting 3242529.desched1
2025-09-27 11:27:06: model execution starting 3242529.desched1
2025-09-27 11:36:57: model execution success 3242529.desched1
2025-09-27 11:36:57: case.run success 3242529.desched1
2025-09-27 11:37:18: case.run starting 3242529.desched1
2025-09-27 11:37:21: model execution starting 3242529.desched1
2025-09-27 11:42:02: model execution success 3242529.desched1
2025-09-27 11:42:02: case.run success 3242529.desched1

2025-09-27 15:30:46: case.run starting 3244839.desched1
2025-09-27 15:30:49: model execution starting 3244839.desched1

2025-09-27 15:40:23: case.run starting 3244895.desched1
2025-09-27 15:40:27: model execution starting 3244895.desched1

2025-09-27 15:40:30: model execution success 3244839.desched1
2025-09-27 15:40:30: case.run success 3244839.desched1
2025-09-27 15:40:50: case.run starting 3244839.desched1
2025-09-27 15:40:54: model execution starting 3244839.desched1
2025-09-27 15:45:33: model execution success 3244839.desched1
2025-09-27 15:45:33: case.run success 3244839.desched1

2025-09-27 15:50:04: model execution success 3244895.desched1
2025-09-27 15:50:04: case.run success 3244895.desched1
```
</details>

My hypothesis is that as the create_test was about to run out of wallclock time, there was some signal that caused the wait_for_test thread to abort... then the parent thread saw that this child thread had finished without a successful test result, which was taken to imply test failure, so it resubmitted the job... and then the same thing happened a few seconds later. (But I don't see any instances of "RECEIVED SIGNAL" in the output files, which I was kind of expecting to see based on a read of [the code](https://github.com/ESMCI/cime/blob/9b1d9d1bcd9ed66a331ee3d2da1d617967b14f5b/CIME/wait_for_tests.py#L22-L26), so I'm not confident about what's happening.)

Chris and I have both reproduced this consistently by submitting a create_test job that has very little wallclock time - e.g., `qcmd -l walltime=00:15:00 -- ./create_test ERS.TL319_t232.G_JRA.derecho_gnu.cice-default --retry 2`. (This uses the `qcmd` script on NCAR's derecho machine, which submits the given command as a batch job. I notice that qcmd does various things with signal handling itself, but I have also reproduced it using a simpler mechanism that doesn't do this signal handling.)

I'm not sure if this behavior is derecho-specific or more general. I tried looking into it but I don't understand the wait-for-tests threading well enough to understand what's causing this or how to fix it. It seems like the best behavior in this situation is for the wait-for-tests to exit without triggering any test resubmission. That will mean that the retry functionality won't work (and anything else that wait-for-tests usually does... I forget what that might be), but that seems better than getting multiple instances of the test job stomping on each other.

@jgfouca do you have ideas about what might be going on here and possibly how to fix this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Under some circumstances, `create_test --retry` leads to multiple instances of a test in the job queue #4862

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Under some circumstances, create_test --retry leads to multiple instances of a test in the job queue #4862

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Under some circumstances, `create_test --retry` leads to multiple instances of a test in the job queue #4862