Skip to content

fix(pyroscope.ebpf): Move Pyroscope ebpf metrics registration after component error handling#5540

Merged
marcsanmi merged 2 commits intografana:mainfrom
crbednarz:fix-pyroscope-ebpf-failure-metrics-cleanup
Apr 8, 2026
Merged

fix(pyroscope.ebpf): Move Pyroscope ebpf metrics registration after component error handling#5540
marcsanmi merged 2 commits intografana:mainfrom
crbednarz:fix-pyroscope-ebpf-failure-metrics-cleanup

Conversation

@crbednarz
Copy link
Copy Markdown
Contributor

Brief description of Pull Request

This change moves the metrics registration to the end of the pyroscope.ebpf component construction. This allows for errors to occur during construction without leaving unused metrics registered.

Pull Request Details

This is a fix for #5156. I've written up the specifics of the issue here: #5156 (comment)

In short, during the construction of the pyroscope.ebpf component, metrics get registered for Prometheus. If the construction process fails, those metrics remain registered. Since components may be re-evaluated asynchronously, this would lead to a secondary "duplicate registration" error occasionally printed, distracting from the actual error.

Issue(s) fixed by this Pull Request

Fixes #5156

Notes to the Reviewer

I went with the lowest impact change I could think of, as the issue itself is fairly minor. Since ms wasn't referenced before the final return, moving it (and newMetrics(reg)) lower meant no failures could occur after metrics were registered.

I've also verified the new test fails without the change.

PR Checklist

  • Tests updated

@crbednarz crbednarz requested a review from a team as a code owner February 13, 2026 23:19
@crbednarz crbednarz changed the title fix(pyroscope.ebpf): Move Pyroscope ebpf registration after component error handling fix(pyroscope.ebpf): Move Pyroscope ebpf metrics registration after component error handling Feb 13, 2026
@crbednarz
Copy link
Copy Markdown
Contributor Author

I realize the commit text is missing the word "metrics" in "ebpf metrics registration". I can change it, but it seems like the PR title is what gets used anyway?

@korniltsev-grafanista
Copy link
Copy Markdown
Contributor

I dont think this should be fixed in the pyroscope components. I believe the alloy itself either should not create a component second time, or should not reuse registrer if they for some strange reason have to

@crbednarz
Copy link
Copy Markdown
Contributor Author

crbednarz commented Feb 17, 2026

I dont think this should be fixed in the pyroscope components. I believe the alloy itself either should not create a component second time, or should not reuse registrer if they for some strange reason have to

That was my thought too initially- as it seemed strange to me that a component build error that would end execution would also be retried. Although I'm still learning my way around the code base, so perhaps there is some underlying reason.

That said, I would argue that there are two separate issues at play here:

  1. The component is rebuilt when it (probably) shouldn't be.
  2. ebpf.New(..., reg, ...) registers entries on reg even when it fails.

I definitely want to get a better understanding of the rationale behind the rebuild behavior, but ignoring that for a moment, (2) still feels like a valid issue to address.

Consider this call:

ebpf, err = ebpf.New(logger, reg, "test-ebpf", args)

On success, ebpf, nil is returned and the expected registration of our metrics into reg occurs.
However, in the event of an error, the call returns nil, err. At this point we have no way of knowing that a side effect has occurred, nor do we have a mechanism to undo it. Our only option would be to completely reconstruct reg, which feels less than ideal. Of course the caller wouldn't intuitively know to do that either, as I think there's an implicit expectation for failed constructions to clean up after themselves.

I figure fixing either issue would resolve #5156. The rebuild issue is definitely a curious one, but I suspect it'll involve a more complex solution as the calls are spread across multiple threads and have a much wider impact.

I'm curious what your thoughts are here?

@korniltsev-grafanista
Copy link
Copy Markdown
Contributor

korniltsev-grafanista commented Apr 6, 2026

Appologise for late response.

The component is rebuilt when it (probably) shouldn't be.

I am still thinking the component either should not be recreated, or the registry should be a pristine one.

ebpf.New(..., reg, ...) registers entries on reg even when it fails.

Again, I beleive this is a problem with alloy. If the contract is that we do not or register any metrics unless the component is created with no error - it should be enforced through code or lint and documented

I will close this for now as I believe this should be fixed in a different way. BTW, it's not clear why you want this to be fixed. If you still believe you want this to land - I suggest you to work with @grafana/grafana-alloy-maintainers and the rest of @grafana/grafana-alloy-profiling-maintainers

@crbednarz
Copy link
Copy Markdown
Contributor Author

I will close this for now as I believe this should be fixed in a different way. BTW, it's not clear why you want this to be fixed. If you still believe you want this to land - I suggest you to work with @grafana/grafana-alloy-maintainers and the rest of @grafana/grafana-alloy-profiling-maintainers

That's honestly a bit disappointing to hear.

For context, I don't have a specific reason to fix this particular issue. I was actually just looking to contribute periodically, and this seemed like a good jumping off point to better understand the code base and surrounding processes. (Which is why the write-ups for this probably exceeds what the issue merits)

Though at this point I'm more confused now than before. This might not resolve the linked issue, but it still felt worthwhile and a net positive.

I'd really like to continue making an effort here, but I'm not sure how to even loop in the groups you've mentioned, nor is it clear to me where the appropriate place for this discussion should have been, given the small size of the change.

Would you be able to help loop in the appropriate party(s) here? Or clarify where I should be discussing this? Any help is appreciated.

@korniltsev-grafanista korniltsev-grafanista requested a review from a team April 6, 2026 23:57
@korniltsev-grafanista
Copy link
Copy Markdown
Contributor

korniltsev-grafanista commented Apr 7, 2026

That's honestly a bit disappointing to hear.

I understand and appologise.

and this seemed like a good jumping off point

Unfortunately this issue seem to be a controversial one. I believe this should be fixed in alloy, alloy seem to believe components have to "fix" and employ these ugly easy to forget and hard to deploy for 3party code util.MustRegisterOrGet or some thing else , for example your solution with reordering of metric.

better understand the code base and surrounding processes

I would start with understanding why the component is created second time if it failed.

but it still felt worthwhile and a net positive.

This looks like hiding a problem to me rather than fixing.

but I'm not sure how to even loop in the groups you've mentioned

I would expect tagging them here should be enough. You can also try your luck in grafana community slack in pyroscope and/or alloy channels.

Would you be able to help loop in the appropriate party(s) here?

I've requested review from @grafana/grafana-alloy-maintainers , other than that I recommend you to try the slack way.

OK, maybe closing the PR was too much, I just wanted to give at least some feedback instead of permanent silence. lets reopen and you seek for other approver. I will not block or rever this "fix", just do not expect an approve from me.

@marcsanmi
Copy link
Copy Markdown
Contributor

Hey @crbednarz, thanks for the thorough write-up and patience here.

I agree with @korniltsev-grafanista that the root cause lives in the alloy framework — ideally a failed Build() should not leave the registerer in a state that breaks subsequent retries.

That said, I do think your reordering is a reasonable defensive fix

I'm on the pyroscope/profiling team as well, and I'd like to wait for the @grafana/grafana-alloy-maintainers to weigh in before we move forward. Specifically, it would be good to hear whether:

  • The framework intends to guarantee a fresh registerer on component rebuild, or
  • Components are expected to handle this themselves

@kalleep
Copy link
Copy Markdown
Contributor

kalleep commented Apr 7, 2026

AFAIK there is nothing in place today to protect against it in alloy. But I agree that we should handle it inside the alloy runtime instead of having to handle this at component level.

Will create a issue for it, in the meantime I don't see the harm of merging this one

Copy link
Copy Markdown
Contributor

@marcsanmi marcsanmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@marcsanmi marcsanmi merged commit a3c57c0 into grafana:main Apr 8, 2026
46 checks passed
@crbednarz
Copy link
Copy Markdown
Contributor Author

Thanks everyone. I really appreciate you all taking the time to look over the change.

Hopefully I'll have some time soon to look into the component rebuild issue as well, as it's definitely interesting.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Apr 23, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pyrocope.ebpf confusing panic

4 participants