-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: gracefully handle process metrics registration failure on unsupported OSes #14319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (60.00%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #14319 +/- ##
==========================================
- Coverage 92.12% 92.11% -0.01%
==========================================
Files 668 669 +1
Lines 41377 41381 +4
==========================================
+ Hits 38117 38120 +3
Misses 2224 2224
- Partials 1036 1037 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Merging this PR will degrade performance by 28.88%
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ❌ | BenchmarkLogsToProto2k |
44.9 µs | 63.1 µs | -28.88% |
| ❌ | BenchmarkTraceSizeBytes |
315.1 µs | 440.6 µs | -28.49% |
| ❌ | BenchmarkProfilesToProto |
1.2 µs | 1.7 µs | -27.07% |
| ❌ | BenchmarkTracesToProto2k |
70.2 µs | 96.3 µs | -27.17% |
Comparing EdgeN8v:fix/14307-aix-crash (0f7ee83) with main (ff135af)
Footnotes
-
20 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
|
@EdgeN8v please fix CI. It seems the code is not formatted properly - you should be able to fix that by running |
|
@axw Done. I've ran make fmt. |
Co-authored-by: Damien Mathieu <[email protected]>
evan-bradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working to address this @EdgeN8v.
Since we have a while until the next release, I'd like to see if we can more robustly address this. If someone else would like to get this in and revisit it later, feel free to dismiss my review.
service/service.go
Outdated
| if err := proctelemetry.RegisterProcessMetrics(srv.telemetrySettings); err != nil { | ||
| return nil, fmt.Errorf("failed to register process metrics: %w", err) | ||
| // Only register process metrics on supported OSes. | ||
| if runtime.GOOS == osAIX { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a little difficulty understanding a few things here, could you help explain it to me?
In my mind, we should take a look at OS support for processes in gopsutil and do one of the following if we know there's no process support:
- Fail fast and error out, require the user to disable process metrics for that OS.
- Print a warn-level message indicating to the user that these metrics are disabled.
In my eyes, we should opt for (1) and require users to acknowledge they will not receive process metrics on their platform. I know the issue mentions a specific PR that caused a regression and introduced this issue, is there a reason we can't fix that? To me, that seems like the better place to address this.
If that is not an option, why is AIX special here? I would want us to instead only allow the operating systems we know we support (as indicated by the gopsutil file I linked above where I believe the error comes from), and everything else will print the warning message. However, if possible, I would prefer to fix the root issue and not register process metrics if they are disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @evan-bradley, thanks for the detailed review!
I've accepted the changelog suggestion.
Regarding the logic: I implemented the specific osAIX check based on the previous discussion to address the immediate regression reported in the issue, while trying to keep the change minimal.
I agree that an allowlist approach (or failing fast with an explicit error message) is more robust architecturally. However, since @dmathieu has already approved the current implementation, I want to be careful about changing the scope.
Do you prefer I refactor this PR to implement the strict allowlist logic (e.g., checking against supported OSes like Linux/Windows/Darwin), or should we merge this immediate fix for AIX first and handle the broader refactoring in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My approval doesn't prevent further refactors. An allowlist makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the logic to use an explicit allowlist (Linux, Windows, Darwin) as suggested. The osAIX constant has also been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR accordingly: process metrics are now gated via build tags (linux/windows/darwin). Other OSes log a warning and continue startup. Error wrapping has also been restored on supported OSes.
Co-authored-by: Evan Bradley <[email protected]>
|
Your latest changes removed the warning when the metrics are disabled. We do want that behavior. |
Description
This PR addresses the crash issue reported in #14307 where the collector fails to start on unsupported operating systems (e.g., AIX) due to
RegisterProcessMetricsreturning an error.Instead of failing the service initialization, this change logs a warning and allows the collector to continue starting, ensuring that core functionality remains available even if process metrics cannot be collected.
Link to Issue
Fixes #14307
Type of change