Skip to content

Detect and correct worker node over-provisioning#13220

Draft
Copilot wants to merge 44 commits intomainfrom
copilot/detect-correct-over-provisioning
Draft

Detect and correct worker node over-provisioning#13220
Copilot wants to merge 44 commits intomainfrom
copilot/detect-correct-over-provisioning

Conversation

Copy link
Contributor

Copilot AI commented Feb 6, 2026

Work item (Internal use):

Summary

When multiple MSBuild instances run concurrently, each spawns worker nodes up to /m limit (defaults to NUM_PROCS), causing over-provisioning. 25+ nodes can linger for ~15 minutes consuming resources.

Modified NodeProviderOutOfProcBase.ShutdownConnectedNodes to:

  • Enumerate system-wide MSBuild processes when marking nodes idle
  • Apply threshold (NUM_PROCS/2 for worker nodes)
  • Send NodeBuildComplete(PrepareForReuse=false) to excess nodes, keeping only those needed to meet threshold

Implementation:

  • DetermineNodesForReuse() - Core logic for selective reuse based on system-wide count
  • GetNodeReuseThreshold() - Virtual method returning threshold (default: max(1, cores/2))
  • CountSystemWideActiveNodes() - Virtual method using improved node detection from PR Improve cross-platform node discovery for reuse with NodeMode filtering #13256
    • Filters by NodeMode.OutOfProcNode for accurate worker node counting
    • Automatically handles dotnet processes and MSBuild.dll filtering
    • Leverages cross-platform process command line retrieval (Windows/macOS/Linux)

Integration with PR #13256:
This PR is rebased on top of PR #13256 ("Improve cross-platform node discovery for reuse with NodeMode filtering") and leverages its improved node detection logic for more accurate and consistent worker node identification across platforms.

Server and RAR nodes use separate lifecycle management and are not affected by this change.

Customer Impact

Reduces resource consumption in scenarios with concurrent builds (DevKit, LLM agents, CI/CD pipelines). System will maintain reasonable node count instead of accumulating 25+ lingering nodes.

Improved accuracy in node counting through NodeMode-based filtering ensures only actual worker nodes are counted, preventing false positives from main MSBuild processes or task hosts.

Regression?

  • Yes
  • No

Behavior change only when enableReuse=true (default). Conservative implementation with try/catch fallback if process enumeration fails.

Testing

8 unit tests added covering:

  • Reuse disabled, threshold=0, under/at/over threshold scenarios
  • Edge cases (single node, partial termination)
  • Default threshold calculation

All existing tests pass. Tests remain valid with improved node detection as they use mocked implementations.

Risk

Low. Changes isolated to node shutdown path when reuse is enabled. Failure to enumerate processes defaults to keeping all nodes (current behavior). Diagnostic traces added for troubleshooting.

The integration with PR #13256's improved node detection reduces risk by providing more accurate node identification across platforms and better handling of dotnet processes.

Original prompt

This section details on the original issue you should resolve

<issue_title>Worker Nodes should attempt to detect and correct over-provisioning</issue_title>
<issue_description>### Summary

When multiple instances of MSBuild are run concurrently from the command line, each instance tries to create and claim a number of worker nodes up to the /m limit (which is NUMPROCS for dotnet by default).

This can lead to entirely too many nodes existing on the host machine, lingering for the configured timeout, which is ~15m by default.

These nodes suck up machine resources and look bad - the engine should work to minimize the number of active nodes past an expected threshold to preserve system resources.

Background and Motivation

As more tools, like DevKit, LLM Agents, and human users, delegate their build and inner-loop experiences to the dotnet CLI, it becomes more and more likely that collisions like the above will occur. I was looking at node trace logs this morning where more than 25 worker nodes were lingering on the machine. Managing these nodes is a pain for users of all kinds - you either have to carefully sequence your build operations, apply /nodereuse:false which slows down subsequent builds, or dangerously and likely incompletely terminate processes.

Proposed Feature

Instead, the nodes themselves should clean themselves up.

When a build completes and an out of proc node marks itself idle, it should attempt to handshake with its siblings. If there are a number of active nodes of the same type equal to a threshold, the idling node should terminate.

Proposed thresholds:

  • Worker nodes: NUM_PROCS/2
  • Server Nodes: > 0
  • RAR-as-a-service-node: >0

Alternative Designs

No response</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@dotnet-policy-service
Copy link
Contributor

Hello @@copilot, I noticed that you’re changing an .swr file or any file under src/Package/MSBuild.VSSetup.. Please make sure to validate this change by an experimental VS insertion. This is accomplished by pushing to an exp/* branch, which requires write permissions to this repo.

Copilot AI and others added 4 commits February 6, 2026 23:56
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Copilot AI changed the title [WIP] Add detection and correction for worker node over-provisioning Detect and correct worker node over-provisioning Feb 7, 2026
Copilot AI requested a review from baronfel February 7, 2026 00:14
@baronfel
Copy link
Member

@copilot rebase all of this work on #13256, and use the improved node-detection logic from that PR to handle the reaping more consistently.

@baronfel baronfel added User Experience Area: Engine Issues impacting the core execution of targets and tasks. labels Feb 17, 2026
AR-May and others added 14 commits February 17, 2026 19:22
### Context
It is quite often that `Path.GetFullPath` is used in the tasks code. For
enlightening we would want to replace it with our own API. The candidate
was `TaskEnvironment.GetAbsolutePath` for replacing.
 
However, they do not do the same set of actions.
`TaskEnvironment.GetAbsolutePath` only absolutizes the path but does not
resolve possible "." and ".." in the path or normalize path separators
on Windows.
For some scenarios this is critical, for some that does not matter. We
should provide an option for fully copy `Path.GetFullPath` behavior

### Changes Made
- Added `AbsolutePath.GetCanonicalForm()`

### Testing
Added unit tests
### Context

Thread-safe tasks must avoid global process state (working directory,
environment variables). `WriteCodeFragment` uses relative paths resolved
against the current working directory, making it unsafe for
multithreaded execution.

### Changes Made

- Add `[MSBuildMultiThreadableTask]` attribute and implement
`IMultiThreadableTask`
- Use `TaskEnvironment.GetAbsolutePath()` to resolve paths before file
operations:
  ```csharp
  outputFilePath = TaskEnvironment.GetAbsolutePath(OutputFile.ItemSpec);
  ```
- Pass absolute paths to `FileUtilities.GetTemporaryFile()`,
`FileUtilities.EnsureDirectoryExists()`, `File.WriteAllText()`, and
`LockCheck.GetLockedFileMessage()`
- Update all test instantiations to set `TaskEnvironment =
TaskEnvironmentHelper.CreateForTest()`
- Preserve behavior: when `OutputDirectory` is relative and `OutputFile`
is not specified, the resulting `OutputFile` output remains relative
(matching original behavior)

### Testing

All 52 WriteCodeFragment tests pass, including:
- A new test `RelativeOutputDirectoryProducesRelativeOutputFile` that
verifies when `OutputDirectory` is relative and `OutputFile` is not
passed, the resulting `OutputFile` is also relative
- The `CommentIsInvariantCulture` tests (added in main) now have
`TaskEnvironment` initialized

### Notes

None.

<!-- START COPILOT CODING AGENT SUFFIX -->



<!-- START COPILOT ORIGINAL PROMPT -->



<details>

<summary>Original prompt</summary>

> Migrate the WriteCodeFragment task to use TaskEnvironment API.
> Make sure to preserve correctness, performance and that the tests pass


</details>



<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for
you](https://github.com/dotnet/msbuild/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot)
— coding agent works faster and does higher quality work when set up for
your repo.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jan Provazník <janprovaznik@microsoft.com>
Co-authored-by: jeffhandley <1031940+jeffhandley@users.noreply.github.com>
This pull request updates the following dependencies

[marker]: <> (Begin:369b758c-dad8-4fea-810a-64fb6b0308e9)
## From https://github.com/dotnet/arcade
- **Subscription**:
[369b758c-dad8-4fea-810a-64fb6b0308e9](https://maestro.dot.net/subscriptions?search=369b758c-dad8-4fea-810a-64fb6b0308e9)
- **Build**:
[20260206.4](https://dev.azure.com/dnceng/internal/_build/results?buildId=2897563)
([300694](https://maestro.dot.net/channel/8394/github:dotnet:arcade/build/300694))
- **Date Produced**: February 6, 2026 10:35:32 PM UTC
- **Commit**:
[58d75363f4b03e408aa2502e0db546173be722d4](dotnet/arcade@58d7536)
- **Branch**:
[release/10.0](https://github.com/dotnet/arcade/tree/release/10.0)

[DependencyUpdate]: <> (Begin)

- **Dependency Updates**:
  - From [10.0.0-beta.26080.4 to 10.0.0-beta.26106.4][1]
     - Microsoft.DotNet.Arcade.Sdk
     - Microsoft.DotNet.XUnitExtensions

[1]: dotnet/arcade@b0f891c...58d7536

[DependencyUpdate]: <> (End)


[marker]: <> (End:369b758c-dad8-4fea-810a-64fb6b0308e9)

Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>
This pull request updates the following dependencies

[marker]: <> (Begin:8a4332aa-2543-4c51-b941-e73f31e22328)
## From https://github.com/dotnet/roslyn
- **Subscription**:
[8a4332aa-2543-4c51-b941-e73f31e22328](https://maestro.dot.net/subscriptions?search=8a4332aa-2543-4c51-b941-e73f31e22328)
- **Build**:
[20260205.1](https://dev.azure.com/dnceng/internal/_build/results?buildId=2896280)
([300424](https://maestro.dot.net/channel/548/github:dotnet:roslyn/build/300424))
- **Date Produced**: February 5, 2026 12:45:29 PM UTC
- **Commit**:
[da86ec644af82dcb8575ff0c22ac6afdcd6b5066](dotnet/roslyn@da86ec6)
- **Branch**: [main](https://github.com/dotnet/roslyn/tree/main)

[DependencyUpdate]: <> (Begin)

- **Dependency Updates**:
  - From [5.5.0-2.26081.2 to 5.5.0-2.26105.1][1]
     - Microsoft.Net.Compilers.Toolset

[1]: dotnet/roslyn@265653e...da86ec6

[DependencyUpdate]: <> (End)


[marker]: <> (End:8a4332aa-2543-4c51-b941-e73f31e22328)

Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>
When using MSBuild's multithreaded mode (-mt / /maxcpucount), the
%(RecursiveDir) built-in metadata was lost when items crossed process
boundaries to the TaskHost. This caused NuGet pack to flatten directory
structures (e.g., build\wix\bundle\bundle.wxs became build\bundle.wxs).

Root cause: TaskParameterTaskItem only copied custom metadata when
wrapping ITaskItem for TaskHost serialization. RecursiveDir is a
built-in metadata that cannot be derived from just the item spec - it
requires the original wildcard pattern
(_includeBeforeWildcardExpansionEscaped).

Fix: Explicitly copy RecursiveDir to custom metadata in
TaskParameterTaskItem constructor before serialization, so it survives
the cross-process boundary.

Also includes related fixes to ProjectItemInstance.cs:
- Preserve _projectDirectory in copy constructor and serialization
- Fix CreateItem(string, string, string) to use
includeBeforeWildcardExpansionEscaped

related: #3121
fixes #13140
…: Build ID 13255870 (#13208)

This is the pull request automatically created by the OneLocBuild task
in the build process to check-in localized files generated based upon
translation source files (.lcl files) handed-back from the downstream
localization pipeline. If there are issues in translations, visit
https://aka.ms/icxLocBug and log bugs for fixes. The OneLocBuild wiki is
https://aka.ms/onelocbuild and the localization process in general is
documented at https://aka.ms/AllAboutLoc.

---------

Co-authored-by: Jan Provazník <janprovaznik@microsoft.com>
…ings (#13148)

### Context
The team was discussing nodemodes in sync this morning and we were all
getting confused. This PR gives names to the existing modes and has us
use them consistently across the codebase for clarity.

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: Jan Provazník <janprovaznik@microsoft.com>
This partially reverts commit bc0dc67.

Dependabot does now support `Directory.Packages.props`, but still seems
to require a project file to be present:

```
updater | 2026/02/09 01:52:58 INFO Discovering build files in workspace [/home/dependabot/dependabot-updater/repo/eng/dependabot].
updater | 2026/02/09 01:52:58 INFO   No dotnet-tools.json file found.
updater | 2026/02/09 01:52:58 INFO   Discovered [global.json] file.
updater | 2026/02/09 01:52:58 INFO   Discovering projects beneath [eng/dependabot].
updater | 2026/02/09 01:52:58 INFO   No project files found.
updater | 2026/02/09 01:52:58 INFO Discovery complete.
```

From
https://github.com/dotnet/msbuild/actions/runs/21809650562/job/62919159519#step:3:174
In our environment, the hosted pools are swamped and we often queue for
half an hour or more waiting for a machine. Switch to machines managed
by the .NET team in the hopes that they'll have more capacity and thus
lower PR turnaround time.
I got annoyed when reviewing some of the tests Copilot generated in a
few recent PRs, so let's try to formalize some of the guidance it wasn't
picking up on.
…ns (#13235)

Add `MockLogger(_output)` to `TransactionalModePreservesAllData` so
build diagnostics appear in test output. Update `tests.instructions.md`
to require MockLogger for tests that build via ProjectCollection
or BuildManager.
YuliiaKovalova and others added 25 commits February 17, 2026 19:22
…: Build ID 13287778 (#13236)

This is the pull request automatically created by the OneLocBuild task
in the build process to check-in localized files generated based upon
translation source files (.lcl files) handed-back from the downstream
localization pipeline. If there are issues in translations, visit
https://aka.ms/icxLocBug and log bugs for fixes. The OneLocBuild wiki is
https://aka.ms/onelocbuild and the localization process in general is
documented at https://aka.ms/AllAboutLoc.

Co-authored-by: YuliiaKovalova <95473390+YuliiaKovalova@users.noreply.github.com>
Fixes #

### Context
On traces we could see about 8 mb of allocations due to add nonempty
strings. When taking it look at the code we could see alot of adding to
a list which was constantly resizing.

![image](https://github.com/user-attachments/assets/5ee67e97-09d0-4d87-9605-8f7ed11955e3)


### Changes Made
* Instead of defaulting a list of default size we use context to
initialize the size to be something closer to what we would like. So
that the capacity only needs to be set once throughout its use.
* Also reuse the same list throughout instead of using a new one every
time.
### Testing

After the changes I took a trace and it shows that this path is has been
reduced by 5mb.

![image](https://github.com/user-attachments/assets/76f6f8af-a680-47ff-a6b7-6464368ee574)



### Notes

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
These scripts are missing the executable bit compared to their
counterparts in dotnet/arcade:

- eng/common/native/install-dependencies.sh
- eng/common/sdk-task.sh
- eng/common/cross/tizen-build-rootfs.sh
- eng/common/cross/install-debs.py
- eng/common/cross/tizen-fetch.sh
- eng/common/vmr-sync.ps1
- eng/common/dotnet.sh
- eng/common/SetupNugetSources.sh
- eng/common/vmr-sync.sh

This matches the permissions set in arcade.
Coordinated change with Visual Studio.

---------

Co-authored-by: Rainer Sigwald <raines@microsoft.com>
### Context
`RemoveRelativeSegments` does the same as `AbsolutePath.GetCanonicalForm()`. There is no need for code duplication.

### Testing
Unit tests
…iscovery

Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
…for parsing

Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
…al tests

Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
…posal

Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
- Rebased all commits on top of PR #13256
- Updated CountSystemWideActiveNodes to use improved node detection
- Now filters by NodeMode.OutOfProcNode for accurate worker node counting
- Automatically handles dotnet processes and MSBuild.dll filtering
- Leverages cross-platform process command line retrieval

Co-authored-by: baronfel <573979+baronfel@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 17, 2026

@copilot rebase all of this work on #13256, and use the improved node-detection logic from that PR to handle the reaping more consistently.

Rebased on PR #13256 and integrated the improved node detection logic.

Key improvements in commit f6e3056:

The over-provisioning detection now uses the more robust node discovery mechanism, providing better consistency across platforms and more accurate identification of worker nodes vs. main MSBuild processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Engine Issues impacting the core execution of targets and tasks. User Experience

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker Nodes should attempt to detect and correct over-provisioning

Comments