Skip to content

JobMonitor: retried Test stage does not resubmit work items from non-completed Helix jobs inherited from a prior attempt #16905

@akoeplinger

Description

@akoeplinger

Summary

When an arcade-pr Test stage is retried in AzDO, the Helix Job Monitor inherits Helix jobs from the prior attempt that never completed (e.g. work items stuck Waiting due to a starved queue). Those jobs are silently re-attached to the monitor's watch list but are not resubmitted. If the queue stays unhealthy — or the queue is purged so those work items will never transition to Finished — the retried stage just waits another ~90 minutes and times out again, accomplishing nothing.

This came up on dotnet/arcade PR #16899 / AzDO build 1438541. The original Test stage timed out because two osx.15.amd64.open Helix jobs (391427a2-1d63-4117-8957-c0c41beed31a, 53f8eec1-0ca1-4be8-8d75-1f49d8be09a8) had 24 Waiting work items each on a starved queue. The osx queue was subsequently purged, so those work items will likely never be marked Finished. After retrying the Test stage, the monitor for attempt 2 reattached those same two jobs and resumed waiting on them — see the log block below.

Attempt 2 log evidence

From Monitor Helix Jobs (attempt 2) on build 1438541:

info: 🔁 Checking for failed Helix jobs to resubmit the failed work items...
info: Resubmitting 1 failed work item(s) for job Windows_NT Build_Release - ubuntu.2204.amd64.open (c81c694c-1c72-4de6-9dc6-7ae2d6be13c1):
      - Microsoft.DotNet.Helix.Sdk.Tests.dll
info: Resubmitted ... as new job ... (8697efcb-be70-4a1a-acea-1ad9e5c62ebb)
info: Resubmitting 1 failed work item(s) for job Windows_NT Build_Release - ubuntu.2204.amd64.open (0390a674-703a-4954-b666-aabdc22e4532):
      - Microsoft.DotNet.Helix.Sdk.Tests.dll
info: Resubmitted ... as new job ... (d2a47f50-778c-469d-bd59-06ec21e0c8af)
info: ✅ Job 'Linux Build_Debug - ubuntu.2204.amd64.open (35ee9f09-...)' succeeded (24 passed, 0 failed)
info: ✅ Job 'Windows_NT Build_Release - windows.11.amd64.client.open (5bce8cbe-...)' succeeded (25 passed, 0 failed)
info: ✅ Job 'Linux Build_Debug - ubuntu.2204.amd64.open (727a99f6-...)' succeeded (24 passed, 0 failed)
info: ✅ Job 'Linux Build_Debug - windows.11.amd64.client.open (85ab4c7c-...)' succeeded (25 passed, 0 failed)
info: ❌ Work item 'Microsoft.DotNet.Helix.Sdk.Tests.dll' in job '... (0390a674-...)' failed (Finished, exit code 1).
info: ❌ Work item 'Microsoft.DotNet.Helix.Sdk.Tests.dll' in job '... (c81c694c-...)' failed (Finished, exit code 1).
info: ℹ️ Status: 6 processed / 6 completed / 4 running / 0 waiting jobs
                 146 processed / 146 completed / 50 running / 0 waiting work items
...
info: ✅ Job '... (8697efcb-...)' succeeded (1 passed, 0 failed)
info: ✅ Job '... (d2a47f50-...)' succeeded (1 passed, 0 failed)
info: ℹ️ Status: 8 processed / 8 completed / 2 running / 0 waiting jobs
                 148 processed / 148 completed / 48 running / 0 waiting work items

Observations:

  • Only the two ubuntu jobs (c81c694c, 0390a674) get a Resubmitting… line. Neither osx job (391427a2, 53f8eec1) ever appears in the attempt 2 log.
  • After the two resubmits drain, the counters settle at 2 running jobs / 48 running work items. That's exactly the two original stuck osx jobs × 24 Waiting work items each — the monitor inherited them and is sitting on them.
  • The osx queue has since been purged, so those work items will likely never be marked Finished. Attempt 2 is on track to time out the same way attempt 1 did.

Root cause

In ResubmitFailedJobsAsync (src/Microsoft.DotNet.Helix/JobMonitor/JobMonitorRunner.cs, around line 845):

IReadOnlyList<HelixJobInfo> allJobs = await _helix.GetJobsForBuildAsync(_helixSource, _options.BuildId, cancellationToken);
IReadOnlyList<HelixJobInfo> scopedJobs = [ ..allJobs.Where(IsHelixJobInScope) ];
IReadOnlyList<HelixJobInfo> latestJobs = GetLatestHelixJobAttempts(scopedJobs);
List<HelixJobInfo> completedHelixJobs =
[
    ..latestJobs.Where(j => j.IsCompleted && IsHelixJobInScope(j))
];

foreach (HelixJobInfo completedJob in completedHelixJobs)
{
    // ... list work items, resubmit failed ones ...
}
  1. GetJobsForBuildAsync filters Helix by BuildId == <azdo build id>. The build id is shared across stage retries, so attempt 2 inherits every Helix job submitted by attempt 1 (including the two stuck osx jobs).
  2. The resubmit loop only processes j.IsCompleted jobs — i.e. jobs whose Helix Finished timestamp is set. A job whose work items are still Waiting because the queue is starved (or purged) is not IsCompleted, so it is silently skipped.
  3. Those skipped jobs are still placed back on the monitor's watch list via JobsForFirstPoll = [..scopedJobs, ..resubmittedJobs], so attempt 2 just sits and waits on jobs that will never finish.

This is a "design-as-intended" path for the happy case (don't double-submit work that's still legitimately in progress), but it has no recovery story for queue-starvation or queue-purge scenarios — which is the most common reason someone actually clicks "Re-run failed jobs" on the Test stage.

Proposed direction

A few options, in roughly increasing invasiveness:

  1. Log inherited non-completed jobs prominently on entry. Cheapest; doesn't fix the retry but makes the next timeout less surprising. Something like Inheriting N non-completed Helix job(s) from a previous attempt: <list>. These will not be resubmitted; they will be re-monitored. If their queue is unhealthy this attempt will time out again.
  2. Treat "all work items still Waiting/Unscheduled and parent build attempt > 1" as a resubmit candidate. Targets stuck/purged-queue scenarios specifically without changing semantics for live-but-slow jobs.
  3. On a stage retry, cancel + resubmit prior-attempt jobs that haven't reached Finished. Most useful for queue starvation, but needs care so a slow-but-live job isn't double-submitted; probably needs an age threshold.

Happy to take a stab at #2 if there's agreement on the approach.

Related

cc @premun

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions