JobMonitor: retried Test stage does not resubmit work items from non-completed Helix jobs inherited from a prior attempt

## Summary

When an arcade-pr Test stage is retried in AzDO, the Helix Job Monitor inherits Helix jobs from the prior attempt that never completed (e.g. work items stuck `Waiting` due to a starved queue). Those jobs are silently re-attached to the monitor's watch list but are **not** resubmitted. If the queue stays unhealthy — or the queue is purged so those work items will never transition to `Finished` — the retried stage just waits another ~90 minutes and times out again, accomplishing nothing.

This came up on dotnet/arcade PR #16899 / AzDO build 1438541. The original Test stage timed out because two `osx.15.amd64.open` Helix jobs (`391427a2-1d63-4117-8957-c0c41beed31a`, `53f8eec1-0ca1-4be8-8d75-1f49d8be09a8`) had 24 Waiting work items each on a starved queue. The osx queue was subsequently purged, so those work items will likely never be marked `Finished`. After retrying the Test stage, the monitor for attempt 2 reattached those same two jobs and resumed waiting on them — see the log block below.

## Attempt 2 log evidence

From `Monitor Helix Jobs` (attempt 2) on build 1438541:

```
info: 🔁 Checking for failed Helix jobs to resubmit the failed work items...
info: Resubmitting 1 failed work item(s) for job Windows_NT Build_Release - ubuntu.2204.amd64.open (c81c694c-1c72-4de6-9dc6-7ae2d6be13c1):
      - Microsoft.DotNet.Helix.Sdk.Tests.dll
info: Resubmitted ... as new job ... (8697efcb-be70-4a1a-acea-1ad9e5c62ebb)
info: Resubmitting 1 failed work item(s) for job Windows_NT Build_Release - ubuntu.2204.amd64.open (0390a674-703a-4954-b666-aabdc22e4532):
      - Microsoft.DotNet.Helix.Sdk.Tests.dll
info: Resubmitted ... as new job ... (d2a47f50-778c-469d-bd59-06ec21e0c8af)
info: ✅ Job 'Linux Build_Debug - ubuntu.2204.amd64.open (35ee9f09-...)' succeeded (24 passed, 0 failed)
info: ✅ Job 'Windows_NT Build_Release - windows.11.amd64.client.open (5bce8cbe-...)' succeeded (25 passed, 0 failed)
info: ✅ Job 'Linux Build_Debug - ubuntu.2204.amd64.open (727a99f6-...)' succeeded (24 passed, 0 failed)
info: ✅ Job 'Linux Build_Debug - windows.11.amd64.client.open (85ab4c7c-...)' succeeded (25 passed, 0 failed)
info: ❌ Work item 'Microsoft.DotNet.Helix.Sdk.Tests.dll' in job '... (0390a674-...)' failed (Finished, exit code 1).
info: ❌ Work item 'Microsoft.DotNet.Helix.Sdk.Tests.dll' in job '... (c81c694c-...)' failed (Finished, exit code 1).
info: ℹ️ Status: 6 processed / 6 completed / 4 running / 0 waiting jobs
                 146 processed / 146 completed / 50 running / 0 waiting work items
...
info: ✅ Job '... (8697efcb-...)' succeeded (1 passed, 0 failed)
info: ✅ Job '... (d2a47f50-...)' succeeded (1 passed, 0 failed)
info: ℹ️ Status: 8 processed / 8 completed / 2 running / 0 waiting jobs
                 148 processed / 148 completed / 48 running / 0 waiting work items
```

Observations:

- Only the two ubuntu jobs (`c81c694c`, `0390a674`) get a `Resubmitting…` line. Neither osx job (`391427a2`, `53f8eec1`) ever appears in the attempt 2 log.
- After the two resubmits drain, the counters settle at `2 running jobs / 48 running work items`. That's exactly the two original stuck osx jobs × 24 Waiting work items each — the monitor inherited them and is sitting on them.
- The osx queue has since been purged, so those work items will likely never be marked `Finished`. Attempt 2 is on track to time out the same way attempt 1 did.

## Root cause

In `ResubmitFailedJobsAsync` (`src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs`, around line 845):

```csharp
IReadOnlyList<HelixJobInfo> allJobs = await _helix.GetJobsForBuildAsync(_helixSource, _options.BuildId, cancellationToken);
IReadOnlyList<HelixJobInfo> scopedJobs = [ ..allJobs.Where(IsHelixJobInScope) ];
IReadOnlyList<HelixJobInfo> latestJobs = GetLatestHelixJobAttempts(scopedJobs);
List<HelixJobInfo> completedHelixJobs =
[
    ..latestJobs.Where(j => j.IsCompleted && IsHelixJobInScope(j))
];

foreach (HelixJobInfo completedJob in completedHelixJobs)
{
    // ... list work items, resubmit failed ones ...
}
```

1. `GetJobsForBuildAsync` filters Helix by `BuildId == <azdo build id>`. The build id is shared across stage retries, so attempt 2 inherits every Helix job submitted by attempt 1 (including the two stuck osx jobs).
2. The resubmit loop only processes `j.IsCompleted` jobs — i.e. jobs whose Helix `Finished` timestamp is set. A job whose work items are still `Waiting` because the queue is starved (or purged) is not `IsCompleted`, so it is silently skipped.
3. Those skipped jobs are still placed back on the monitor's watch list via `JobsForFirstPoll = [..scopedJobs, ..resubmittedJobs]`, so attempt 2 just sits and waits on jobs that will never finish.

This is a "design-as-intended" path for the happy case (don't double-submit work that's still legitimately in progress), but it has no recovery story for queue-starvation or queue-purge scenarios — which is the most common reason someone actually clicks "Re-run failed jobs" on the Test stage.

## Proposed direction

A few options, in roughly increasing invasiveness:

1. **Log inherited non-completed jobs prominently on entry.** Cheapest; doesn't fix the retry but makes the next timeout less surprising. Something like `Inheriting N non-completed Helix job(s) from a previous attempt: <list>. These will not be resubmitted; they will be re-monitored. If their queue is unhealthy this attempt will time out again.`
2. **Treat "all work items still `Waiting`/`Unscheduled` and parent build attempt > 1" as a resubmit candidate.** Targets stuck/purged-queue scenarios specifically without changing semantics for live-but-slow jobs.
3. **On a stage retry, cancel + resubmit prior-attempt jobs that haven't reached `Finished`.** Most useful for queue starvation, but needs care so a slow-but-live job isn't double-submitted; probably needs an age threshold.

Happy to take a stab at #2 if there's agreement on the approach.

## Related

- dotnet/arcade#16904 — a separate stale-snapshot bug in the timeout reporter that this same incident exposed.
- lewing/helix.mcp#66 — fix for `helix_batch_status` mis-counting Waiting work items as failed, which is what initially misled triage of this incident.

cc @premun


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobMonitor: retried Test stage does not resubmit work items from non-completed Helix jobs inherited from a prior attempt #16905

Summary

Attempt 2 log evidence

Root cause

Proposed direction

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

JobMonitor: retried Test stage does not resubmit work items from non-completed Helix jobs inherited from a prior attempt #16905

Description

Summary

Attempt 2 log evidence

Root cause

Proposed direction

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions