Skip to content

Fix flaky tests: realistic timeouts and robust assertions#13250

Draft
JanProvaznik wants to merge 1 commit intodotnet:mainfrom
JanProvaznik:fix-flaky-tests-timeouts
Draft

Fix flaky tests: realistic timeouts and robust assertions#13250
JanProvaznik wants to merge 1 commit intodotnet:mainfrom
JanProvaznik:fix-flaky-tests-timeouts

Conversation

@JanProvaznik
Copy link
Member

@JanProvaznik JanProvaznik commented Feb 12, 2026

Summary

Fix the top flaky tests identified from analyzing 100 recent CI builds (49% failure rate, 40/49 due to test flakes).

Changes

Genuine design fixes

  • Exec_Tests.Timeout: Warning count assertion changed from exact match (ShouldBe(1) or ShouldBe(2)) to ShouldBeGreaterThanOrEqualTo(1). The flake is MSB5018 temp file lock contention—documented in [Flaky test] Microsoft.Build.UnitTests.Exec_Tests.Timeout #9176 across 4+ CI failures from 2023-2025, where another process (likely AV) holds a lock on the .exec.cmd temp file, generating an additional warning. The timeout value (5ms) is kept as-is.
  • MSBuildServer mre.WaitOne(): Was infinite with no timeout — genuine hang risk. Bounded to 60s with assertion message.
  • MSBuildServer WaitForExit: Checked HasExited after WaitForExit returned (redundant+racey). Replaced with single WaitForExit(30_000).ShouldBeTrue(msg).
  • GenerateResource ForceSomeOutOfDate: Removed redundant 1000ms sleep between input touch and task execution.

Timeout increases (correct synchronization, too tight for CI)

These are max bounds that return immediately when the condition is met—zero cost in the happy path.

Test Old New What it waits for
CanceledTasksDoNotLogMSB4181 WaitOne 2s 30s Build reaching Exec task
TaskNodesDieAfterBuild WaitForExit 3s 15s Process exit after build
TaskNodesDieAfterBuild poll Sleep 2s 3s Process startup
MSBuildServer KillTree 1s 10s Process tree termination
DownloadFile.CanBeCanceled Wait 1.5s 10s Cancel propagation
GenerateResource Sleep (×8 sites) 200ms 500ms FS timestamp gap
GenerateResource Sleep (macOS) 1s/100ms 1.5s/500ms HFS/NTFS granularity

CI impact

  • WaitOne/WaitForExit are max bounds — zero cost in happy path
  • Net wall-clock increase: ~3 seconds across the entire test suite
  • Removed 1 redundant 1000ms sleep, so actual net is ~+2s

Validation

  • All 3 test projects build with 0 errors, 0 warnings
  • 10/10 passes: Engine.UnitTests (CanceledTasksDoNotLogMSB4181, TaskNodesDieAfterBuild)
  • 10/10 passes: Tasks.UnitTests (Exec_Tests.Timeout, DownloadFile_Tests.CanBeCanceled)
  • 10/10 passes: Tasks.UnitTests (GenerateResource timestamp tests)

Copilot AI review requested due to automatic review settings February 12, 2026 15:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce CI flakiness across several unit test projects by adjusting unrealistic timeouts, bounding potentially infinite waits, and making assertions more tolerant of non-deterministic warnings.

Changes:

  • Increased various test timeouts / waits (e.g., Exec.Timeout, WaitOne, WaitForExit) to better match real scheduling and CI conditions.
  • Adjusted Exec_Tests.Timeout warning assertions to tolerate additional intermittent warnings (e.g., MSB5018).
  • Tweaked GenerateResource timestamp-related sleeps and removed a redundant delay to reduce unnecessary wall-clock time while preserving determinism.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/Tasks.UnitTests/ResourceHandling/GenerateResource_Tests.cs Adjusts sleeps and file timestamp updates in incremental rebuild tests.
src/Tasks.UnitTests/ResourceHandling/GenerateResourceOutOfProc_Tests.cs Mirrors timestamp/sleep adjustments for out-of-proc resource generation tests.
src/Tasks.UnitTests/Exec_Tests.cs Raises Exec.Timeout and makes warning assertions more robust to environmental noise.
src/Tasks.UnitTests/DownloadFile_Tests.cs Increases cancellation completion wait bound and adds a clearer assertion message.
src/MSBuild.UnitTests/MSBuildServer_Tests.cs Adds bounded waits to avoid hangs and increases shutdown/kill timeouts.
src/Build.UnitTests/BackEnd/TaskHostFactory_Tests.cs Increases timeouts around task host process lifecycle verification.
src/Build.UnitTests/BackEnd/TaskBuilder_Tests.cs Increases synchronization timeout for cancellation-related build test.
Comments suppressed due to low confidence (1)

src/MSBuild.UnitTests/MSBuildServer_Tests.cs:208

  • This test toggles MSBUILDUSESERVER via Environment.SetEnvironmentVariable, which bypasses TestEnvironment tracking and can leak environment state to subsequent tests if this test fails early (or forgets to restore). Prefer using _env.SetEnvironmentVariable for these toggles (or capture/restore the original value) so the environment is reliably reverted on dispose.
            mre.WaitOne(60_000).ShouldBeTrue("Timed out waiting for marker file creation indicating server is busy.");
            _output.WriteLine("It's OK to go ahead.");

            Environment.SetEnvironmentVariable("MSBUILDUSESERVER", "0");

@JanProvaznik JanProvaznik force-pushed the fix-flaky-tests-timeouts branch from 150b90d to dd93207 Compare February 12, 2026 15:56
@JanProvaznik
Copy link
Member Author

JanProvaznik commented Feb 12, 2026

(running several times, 1 pass)
/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JanProvaznik
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JanProvaznik
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JanProvaznik JanProvaznik marked this pull request as draft February 16, 2026 11:35
- CanceledTasksDoNotLogMSB4181: WaitOne 2s→30s (max bound, no CI impact)
- TaskNodesDieAfterBuild: WaitForExit 3s→15s, retry Sleep 2s→3s
- MSBuildServer: KillTree 1s→10s, bounded infinite mre.WaitOne() to
  60s with assertion, WaitForExit 10s→30s with assertion
- Exec_Tests.Timeout: warning count exact match (1 or 2) changed to
  ≥1 to handle non-deterministic MSB5018 from temp file lock
  contention (documented in dotnet#9176 across 4+ CI failures 2023-2025)
- DownloadFile.CanBeCanceled: Wait 1.5s→10s
- GenerateResource timestamp tests: Sleep 200ms→500ms across 8 sites,
  100ms→500ms for NTFS (1.5s for macOS HFS); removed redundant 1s
  sleep in ForceSomeOutOfDate
- Net CI wall-clock impact: ~+3s across entire test suite
@JanProvaznik JanProvaznik force-pushed the fix-flaky-tests-timeouts branch from dd93207 to 455ca6c Compare February 19, 2026 12:26
@JanProvaznik
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@JanProvaznik
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments