Skip to content

Commit 9d2d417

Browse files
Add troubleshooting section for Elastic Agent upgrade failing on Windows (elastic#1750)
This PR adds a new section in Ingest tools > Fleet and Elastic Agent > [Common problems](https://www.elastic.co/docs/troubleshoot/ingest/fleet/common-problems#agent-kustomize-manifest) for troubleshooting an Elastic Agent upgrade failure on Windows caused by Windows desktop heap exhaustion. Closes [elastic#1775](elastic/ingest-docs#1775) --------- Co-authored-by: Karen Metts <[email protected]>
1 parent 7f48288 commit 9d2d417

File tree

1 file changed

+45
-0
lines changed

1 file changed

+45
-0
lines changed

troubleshoot/ingest/fleet/common-problems.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ Find troubleshooting information for {{fleet}}, {{fleet-server}}, and {{agent}}
5959
* [Hosted {{agent}} is offline](#hosted-agent-offline)
6060
* [APM & {{fleet}} fails to upgrade to 8.x on {{ecloud}}](#hosted-agent-8-x-upgrade-fail)
6161
* [Air-gapped {{agent}} upgrade can fail due to an inaccessible PGP key](#pgp-key-download-fail)
62+
* [{{agent}} upgrade fails on Windows with exit status `0xc0000142`](#agent-upgrade-fail-windows)
6263
* [{{agents}} are unable to connect after removing the {{fleet-server}} integration](#fleet-server-integration-removed)
6364
* [{{agent}} Out of Memory errors on Kubernetes](#agent-oom-k8s)
6465
* [Error when running {{agent}} commands with `sudo`](#agent-sudo-error)
@@ -652,6 +653,50 @@ curl -u elastic:<password> --request POST \
652653
In versions 8.9 and above, an {{agent}} upgrade may fail when the upgrader can’t access a PGP key required to verify the binary signature. For details and a workaround, refer to the [PGP key download fails in an air-gapped environment](https://www.elastic.co/guide/en/fleet/8.9/release-notes-8.9.0.html#known-issue-3375) known issue in the version 8.9.0 Release Notes or to the [workaround documentation](https://github.com/elastic/elastic-agent/blob/main/docs/pgp-workaround.md) in the elastic-agent GitHub repository.
653654
654655
656+
## {{agent}} upgrade fails on Windows with exit status `0xc0000142` [agent-upgrade-fail-windows]
657+
658+
During an {{agent}} upgrade on Windows, {{agent}} spawns a "watcher" process that monitors the upgrade process. Windows attempts to create a temporary console for this process. If Windows can't create this console, the watcher process initialization fails with error code `0xc0000142` (`STATUS_DLL_INIT_FAILED`), resulting in an upgrade failure. {{agent}} logs this error at the `info` level.
659+
660+
The error is caused by Windows [desktop heap exhaustion](https://learn.microsoft.com/en-us/troubleshoot/windows-server/performance/desktop-heap-limitation-out-of-memory). When {{agent}} runs as a [Windows service application](https://learn.microsoft.com/en-us/dotnet/framework/windows-services/introduction-to-windows-service-applications), it uses the service desktop, and shares the desktop heap with other running services. If a service process is using windowing resources, but is failing to release them, this may exhaust the desktop heap and affect {{agent}}.
661+
662+
:::{note}
663+
Interactively-run instances of `elastic-agent.exe` are not subject to this limitation. Only instances running as a service are potentially affected.
664+
:::
665+
666+
To resolve the issue, you can try the following:
667+
668+
- **Update {{agent}} immediately after a system reboot**
669+
670+
A system reboot destroys and recreates the desktop heap, resolving any prior exhaustion.
671+
Because many memory leaks are gradual, updating {{agent}} immediately after a system reboot may allow {{agent}} to upgrade before the memory leaking application exhausts the desktop heap.
672+
673+
:::{tip}
674+
A [cold startup](https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/distinguishing-fast-startup-from-wake-from-hibernation) resets kernel memory, but a fast startup or a wake from hibernation does not.
675+
A regular reboot (for example, `shutdown /r /t 0`) results in a cold startup, and resets the desktop heap.
676+
:::
677+
678+
- **Update third-party service applications**
679+
680+
As standard Windows tools such as Task Manager and Process Explorer do not attribute desktop heap usage by application, you have to consider updating all third-party processes that are running as a service. To list these applications, use the following PowerShell command:
681+
682+
```powershell
683+
PS C:\> Get-Process | Where {$_.SI -eq 0} | Where {$_.MainModule.FileVersionInfo.ProductName -and (-not (($_.MainModule.FileVersionInfo.CompanyName -eq "Microsoft Corporation") -and ($_.MainModule.FileVersionInfo.ProductName -like "*Windows*"))) } | ForEach-Object { $_.MainModule.FileVersionInfo.ProductName + ' - ' + $_.Path }
684+
```
685+
686+
You can then install any updates from the listed applications' manufacturers.
687+
688+
- **Terminate or uninstall third-party service applications**
689+
690+
You can try terminating or uninstalling non-critical third-party service applications before updating {{agent}}.
691+
Terminating a process releases its desktop heap resources.
692+
693+
Note that the {{agent}} update process does not require a significant amount of desktop heap resources, so a successful {{agent}} update following the termination or uninstallation of a service application does not necessarily mean that the application was exhausting the desktop heap.
694+
695+
- **Resize the desktop heap**
696+
697+
As a short-term solution, follow the steps described in the [Microsoft guide](https://learn.microsoft.com/en-us/troubleshoot/windows-server/performance/desktop-heap-limitation-out-of-memory) to increase the size of the desktop heap. Note that if a service application is causing a memory leak, increasing the size of the desktop heap may only postpone the desktop heap exhaustion.
698+
699+
655700
## {{agents}} are unable to connect after removing the {{fleet-server}} integration [fleet-server-integration-removed]
656701
657702
When you use {{fleet}}-managed {{agent}}, at least one {{agent}} needs to be running the [{{fleet-server}} integration](https://docs.elastic.co/integrations/fleet_server). In case the policy containing this integration is accidentally removed from {{agent}}, all other agents will not be able to be managed. However, the {{agents}} will continue to send data to their configured output.

0 commit comments

Comments
 (0)