Summary
When an arcade-pr Test stage is retried in AzDO, the Helix Job Monitor inherits Helix jobs from the prior attempt that never completed (e.g. work items stuck Waiting due to a starved queue). Those jobs are silently re-attached to the monitor's watch list but are not resubmitted. If the queue stays unhealthy — or the queue is purged so those work items will never transition to Finished — the retried stage just waits another ~90 minutes and times out again, accomplishing nothing.
This came up on dotnet/arcade PR #16899 / AzDO build 1438541. The original Test stage timed out because two osx.15.amd64.open Helix jobs (391427a2-1d63-4117-8957-c0c41beed31a, 53f8eec1-0ca1-4be8-8d75-1f49d8be09a8) had 24 Waiting work items each on a starved queue. The osx queue was subsequently purged, so those work items will likely never be marked Finished. After retrying the Test stage, the monitor for attempt 2 reattached those same two jobs and resumed waiting on them — see the log block below.
Attempt 2 log evidence
From Monitor Helix Jobs (attempt 2) on build 1438541:
info: 🔁 Checking for failed Helix jobs to resubmit the failed work items...
info: Resubmitting 1 failed work item(s) for job Windows_NT Build_Release - ubuntu.2204.amd64.open (c81c694c-1c72-4de6-9dc6-7ae2d6be13c1):
- Microsoft.DotNet.Helix.Sdk.Tests.dll
info: Resubmitted ... as new job ... (8697efcb-be70-4a1a-acea-1ad9e5c62ebb)
info: Resubmitting 1 failed work item(s) for job Windows_NT Build_Release - ubuntu.2204.amd64.open (0390a674-703a-4954-b666-aabdc22e4532):
- Microsoft.DotNet.Helix.Sdk.Tests.dll
info: Resubmitted ... as new job ... (d2a47f50-778c-469d-bd59-06ec21e0c8af)
info: ✅ Job 'Linux Build_Debug - ubuntu.2204.amd64.open (35ee9f09-...)' succeeded (24 passed, 0 failed)
info: ✅ Job 'Windows_NT Build_Release - windows.11.amd64.client.open (5bce8cbe-...)' succeeded (25 passed, 0 failed)
info: ✅ Job 'Linux Build_Debug - ubuntu.2204.amd64.open (727a99f6-...)' succeeded (24 passed, 0 failed)
info: ✅ Job 'Linux Build_Debug - windows.11.amd64.client.open (85ab4c7c-...)' succeeded (25 passed, 0 failed)
info: ❌ Work item 'Microsoft.DotNet.Helix.Sdk.Tests.dll' in job '... (0390a674-...)' failed (Finished, exit code 1).
info: ❌ Work item 'Microsoft.DotNet.Helix.Sdk.Tests.dll' in job '... (c81c694c-...)' failed (Finished, exit code 1).
info: ℹ️ Status: 6 processed / 6 completed / 4 running / 0 waiting jobs
146 processed / 146 completed / 50 running / 0 waiting work items
...
info: ✅ Job '... (8697efcb-...)' succeeded (1 passed, 0 failed)
info: ✅ Job '... (d2a47f50-...)' succeeded (1 passed, 0 failed)
info: ℹ️ Status: 8 processed / 8 completed / 2 running / 0 waiting jobs
148 processed / 148 completed / 48 running / 0 waiting work items
Observations:
- Only the two ubuntu jobs (
c81c694c, 0390a674) get a Resubmitting… line. Neither osx job (391427a2, 53f8eec1) ever appears in the attempt 2 log.
- After the two resubmits drain, the counters settle at
2 running jobs / 48 running work items. That's exactly the two original stuck osx jobs × 24 Waiting work items each — the monitor inherited them and is sitting on them.
- The osx queue has since been purged, so those work items will likely never be marked
Finished. Attempt 2 is on track to time out the same way attempt 1 did.
Root cause
In ResubmitFailedJobsAsync (src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs, around line 845):
IReadOnlyList<HelixJobInfo> allJobs = await _helix.GetJobsForBuildAsync(_helixSource, _options.BuildId, cancellationToken);
IReadOnlyList<HelixJobInfo> scopedJobs = [ ..allJobs.Where(IsHelixJobInScope) ];
IReadOnlyList<HelixJobInfo> latestJobs = GetLatestHelixJobAttempts(scopedJobs);
List<HelixJobInfo> completedHelixJobs =
[
..latestJobs.Where(j => j.IsCompleted && IsHelixJobInScope(j))
];
foreach (HelixJobInfo completedJob in completedHelixJobs)
{
// ... list work items, resubmit failed ones ...
}
GetJobsForBuildAsync filters Helix by BuildId == <azdo build id>. The build id is shared across stage retries, so attempt 2 inherits every Helix job submitted by attempt 1 (including the two stuck osx jobs).
- The resubmit loop only processes
j.IsCompleted jobs — i.e. jobs whose Helix Finished timestamp is set. A job whose work items are still Waiting because the queue is starved (or purged) is not IsCompleted, so it is silently skipped.
- Those skipped jobs are still placed back on the monitor's watch list via
JobsForFirstPoll = [..scopedJobs, ..resubmittedJobs], so attempt 2 just sits and waits on jobs that will never finish.
This is a "design-as-intended" path for the happy case (don't double-submit work that's still legitimately in progress), but it has no recovery story for queue-starvation or queue-purge scenarios — which is the most common reason someone actually clicks "Re-run failed jobs" on the Test stage.
Proposed direction
A few options, in roughly increasing invasiveness:
- Log inherited non-completed jobs prominently on entry. Cheapest; doesn't fix the retry but makes the next timeout less surprising. Something like
Inheriting N non-completed Helix job(s) from a previous attempt: <list>. These will not be resubmitted; they will be re-monitored. If their queue is unhealthy this attempt will time out again.
- Treat "all work items still
Waiting/Unscheduled and parent build attempt > 1" as a resubmit candidate. Targets stuck/purged-queue scenarios specifically without changing semantics for live-but-slow jobs.
- On a stage retry, cancel + resubmit prior-attempt jobs that haven't reached
Finished. Most useful for queue starvation, but needs care so a slow-but-live job isn't double-submitted; probably needs an age threshold.
Happy to take a stab at #2 if there's agreement on the approach.
Related
cc @premun
Summary
When an arcade-pr Test stage is retried in AzDO, the Helix Job Monitor inherits Helix jobs from the prior attempt that never completed (e.g. work items stuck
Waitingdue to a starved queue). Those jobs are silently re-attached to the monitor's watch list but are not resubmitted. If the queue stays unhealthy — or the queue is purged so those work items will never transition toFinished— the retried stage just waits another ~90 minutes and times out again, accomplishing nothing.This came up on dotnet/arcade PR #16899 / AzDO build 1438541. The original Test stage timed out because two
osx.15.amd64.openHelix jobs (391427a2-1d63-4117-8957-c0c41beed31a,53f8eec1-0ca1-4be8-8d75-1f49d8be09a8) had 24 Waiting work items each on a starved queue. The osx queue was subsequently purged, so those work items will likely never be markedFinished. After retrying the Test stage, the monitor for attempt 2 reattached those same two jobs and resumed waiting on them — see the log block below.Attempt 2 log evidence
From
Monitor Helix Jobs(attempt 2) on build 1438541:Observations:
c81c694c,0390a674) get aResubmitting…line. Neither osx job (391427a2,53f8eec1) ever appears in the attempt 2 log.2 running jobs / 48 running work items. That's exactly the two original stuck osx jobs × 24 Waiting work items each — the monitor inherited them and is sitting on them.Finished. Attempt 2 is on track to time out the same way attempt 1 did.Root cause
In
ResubmitFailedJobsAsync(src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs, around line 845):GetJobsForBuildAsyncfilters Helix byBuildId == <azdo build id>. The build id is shared across stage retries, so attempt 2 inherits every Helix job submitted by attempt 1 (including the two stuck osx jobs).j.IsCompletedjobs — i.e. jobs whose HelixFinishedtimestamp is set. A job whose work items are stillWaitingbecause the queue is starved (or purged) is notIsCompleted, so it is silently skipped.JobsForFirstPoll = [..scopedJobs, ..resubmittedJobs], so attempt 2 just sits and waits on jobs that will never finish.This is a "design-as-intended" path for the happy case (don't double-submit work that's still legitimately in progress), but it has no recovery story for queue-starvation or queue-purge scenarios — which is the most common reason someone actually clicks "Re-run failed jobs" on the Test stage.
Proposed direction
A few options, in roughly increasing invasiveness:
Inheriting N non-completed Helix job(s) from a previous attempt: <list>. These will not be resubmitted; they will be re-monitored. If their queue is unhealthy this attempt will time out again.Waiting/Unscheduledand parent build attempt > 1" as a resubmit candidate. Targets stuck/purged-queue scenarios specifically without changing semantics for live-but-slow jobs.Finished. Most useful for queue starvation, but needs care so a slow-but-live job isn't double-submitted; probably needs an age threshold.Happy to take a stab at #2 if there's agreement on the approach.
Related
helix_batch_statusmis-counting Waiting work items as failed, which is what initially misled triage of this incident.cc @premun