fix(pyroscope.ebpf): Move Pyroscope ebpf metrics registration after component error handling#5540
Conversation
|
I realize the commit text is missing the word "metrics" in "ebpf metrics registration". I can change it, but it seems like the PR title is what gets used anyway? |
|
I dont think this should be fixed in the pyroscope components. I believe the alloy itself either should not create a component second time, or should not reuse registrer if they for some strange reason have to |
That was my thought too initially- as it seemed strange to me that a component build error that would end execution would also be retried. Although I'm still learning my way around the code base, so perhaps there is some underlying reason. That said, I would argue that there are two separate issues at play here:
I definitely want to get a better understanding of the rationale behind the rebuild behavior, but ignoring that for a moment, (2) still feels like a valid issue to address. Consider this call: ebpf, err = ebpf.New(logger, reg, "test-ebpf", args)On success, I figure fixing either issue would resolve #5156. The rebuild issue is definitely a curious one, but I suspect it'll involve a more complex solution as the calls are spread across multiple threads and have a much wider impact. I'm curious what your thoughts are here? |
|
Appologise for late response.
I am still thinking the component either should not be recreated, or the registry should be a pristine one.
Again, I beleive this is a problem with alloy. If the contract is that we do not or register any metrics unless the component is created with no error - it should be enforced through code or lint and documented I will close this for now as I believe this should be fixed in a different way. BTW, it's not clear why you want this to be fixed. If you still believe you want this to land - I suggest you to work with @grafana/grafana-alloy-maintainers and the rest of @grafana/grafana-alloy-profiling-maintainers |
That's honestly a bit disappointing to hear. For context, I don't have a specific reason to fix this particular issue. I was actually just looking to contribute periodically, and this seemed like a good jumping off point to better understand the code base and surrounding processes. (Which is why the write-ups for this probably exceeds what the issue merits) Though at this point I'm more confused now than before. This might not resolve the linked issue, but it still felt worthwhile and a net positive. I'd really like to continue making an effort here, but I'm not sure how to even loop in the groups you've mentioned, nor is it clear to me where the appropriate place for this discussion should have been, given the small size of the change. Would you be able to help loop in the appropriate party(s) here? Or clarify where I should be discussing this? Any help is appreciated. |
I understand and appologise.
Unfortunately this issue seem to be a controversial one. I believe this should be fixed in alloy, alloy seem to believe components have to "fix" and employ these ugly easy to forget and hard to deploy for 3party code
I would start with understanding why the component is created second time if it failed.
This looks like hiding a problem to me rather than fixing.
I would expect tagging them here should be enough. You can also try your luck in grafana community slack in pyroscope and/or alloy channels.
I've requested review from @grafana/grafana-alloy-maintainers , other than that I recommend you to try the slack way. OK, maybe closing the PR was too much, I just wanted to give at least some feedback instead of permanent silence. lets reopen and you seek for other approver. I will not block or rever this "fix", just do not expect an approve from me. |
|
Hey @crbednarz, thanks for the thorough write-up and patience here. I agree with @korniltsev-grafanista that the root cause lives in the alloy framework — ideally a failed That said, I do think your reordering is a reasonable defensive fix I'm on the pyroscope/profiling team as well, and I'd like to wait for the @grafana/grafana-alloy-maintainers to weigh in before we move forward. Specifically, it would be good to hear whether:
|
|
AFAIK there is nothing in place today to protect against it in alloy. But I agree that we should handle it inside the alloy runtime instead of having to handle this at component level. Will create a issue for it, in the meantime I don't see the harm of merging this one |
marcsanmi
left a comment
There was a problem hiding this comment.
LGTM, thanks for the contribution!
|
Thanks everyone. I really appreciate you all taking the time to look over the change. Hopefully I'll have some time soon to look into the component rebuild issue as well, as it's definitely interesting. |
Brief description of Pull Request
This change moves the metrics registration to the end of the pyroscope.ebpf component construction. This allows for errors to occur during construction without leaving unused metrics registered.
Pull Request Details
This is a fix for #5156. I've written up the specifics of the issue here: #5156 (comment)
In short, during the construction of the pyroscope.ebpf component, metrics get registered for Prometheus. If the construction process fails, those metrics remain registered. Since components may be re-evaluated asynchronously, this would lead to a secondary "duplicate registration" error occasionally printed, distracting from the actual error.
Issue(s) fixed by this Pull Request
Fixes #5156
Notes to the Reviewer
I went with the lowest impact change I could think of, as the issue itself is fairly minor. Since
mswasn't referenced before the final return, moving it (andnewMetrics(reg)) lower meant no failures could occur after metrics were registered.I've also verified the new test fails without the change.
PR Checklist