Skip to content

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Oct 7, 2025

What does this PR do?

Uses a random port for the otel collector's Prometheus endpoint. Right now this is hardcoded at the default.

Normally, this port needs to be known during config generation (so we can add the right self-monitoring configuration), so it needs to be determined before the coordinator starts. In this PR, we instead put it into an environment variable and use the otel collector's ability to load configuration values from the variable. As a result, we can defer determining the actual port as late as possible, and even use a different port on each configuration reload, allowing us to recover from port binding conflicts.

This becomes a bit awkward with the embedded collector, where we need to call SetEnv, always an anti-pattern. But we eventually expect to retire that mode of execution, and it's not the default anymore, so it should be fine.

In the process, I'm also allowing both the metrics port and the healthcheck port to be passed into the otel manager as parameters. This is preparation for making them configurable in a follow-up.

Why is it important?

The port shouldn't be hardcoded. In the event of a conflict, the otel collector can't start.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

How to test this PR locally

Build the agent locally, run it with otel self-monitoring, generate diagnostics, then check the configuration.

Related issues


This is an automatic backport of pull request #10240 done by [Mergify](https://mergify.com).

* Use a random port for otel collector monitoring endpoint

* mage notice

* Use an env variable

* Fix linter warnings

* Fix random port determination for the embedded otel collector

* Drop the ports functions from the utils package

* fixup! Fix random port determination for the embedded otel collector

* Ensure no port conflicts

* Clean up port assignment

* Verify that returned ports are unique

* Add port conflict test

* Fix docstring typo

* More comments

* Add comments explaining the port conflict test

* Update internal/pkg/otel/manager/execution_subprocess.go

Co-authored-by: Blake Rouse <[email protected]>

---------

Co-authored-by: Blake Rouse <[email protected]>
(cherry picked from commit 5cb8c31)
@mergify mergify bot requested a review from a team as a code owner October 7, 2025 19:10
@mergify mergify bot added the backport label Oct 7, 2025
@mergify mergify bot requested review from straistaru and ycombinator and removed request for a team October 7, 2025 19:10
@mergify mergify bot added the backport label Oct 7, 2025
@github-actions github-actions bot added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog labels Oct 7, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

cc @swiatekm

@swiatekm swiatekm merged commit dc1ff4c into 9.2 Oct 8, 2025
24 checks passed
@swiatekm swiatekm deleted the mergify/bp/9.2/pr-10240 branch October 8, 2025 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport bug Something isn't working skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants