Skip to content

Conversation

@EdgeN8v
Copy link

@EdgeN8v EdgeN8v commented Dec 22, 2025

Description

This PR addresses the crash issue reported in #14307 where the collector fails to start on unsupported operating systems (e.g., AIX) due to RegisterProcessMetrics returning an error.

Instead of failing the service initialization, this change logs a warning and allows the collector to continue starting, ensuring that core functionality remains available even if process metrics cannot be collected.

Link to Issue

Fixes #14307

Type of change

  • Bug fix (non-breaking change which fixes an issue)

@EdgeN8v EdgeN8v requested a review from a team as a code owner December 22, 2025 06:54
@EdgeN8v EdgeN8v requested a review from axw December 22, 2025 06:54
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Dec 22, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@codecov
Copy link

codecov bot commented Jan 6, 2026

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.11%. Comparing base (58db72d) to head (0f7ee83).
⚠️ Report is 25 commits behind head on main.

Files with missing lines Patch % Lines
service/processmetrics_valid.go 50.00% 1 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (60.00%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14319      +/-   ##
==========================================
- Coverage   92.12%   92.11%   -0.01%     
==========================================
  Files         668      669       +1     
  Lines       41377    41381       +4     
==========================================
+ Hits        38117    38120       +3     
  Misses       2224     2224              
- Partials     1036     1037       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codspeed-hq
Copy link

codspeed-hq bot commented Jan 6, 2026

Merging this PR will degrade performance by 28.88%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

❌ 4 regressed benchmarks
✅ 55 untouched benchmarks
⏩ 20 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
BenchmarkLogsToProto2k 44.9 µs 63.1 µs -28.88%
BenchmarkTraceSizeBytes 315.1 µs 440.6 µs -28.49%
BenchmarkProfilesToProto 1.2 µs 1.7 µs -27.07%
BenchmarkTracesToProto2k 70.2 µs 96.3 µs -27.17%

Comparing EdgeN8v:fix/14307-aix-crash (0f7ee83) with main (ff135af)

Open in CodSpeed

Footnotes

  1. 20 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@axw
Copy link
Contributor

axw commented Jan 6, 2026

@EdgeN8v please fix CI. It seems the code is not formatted properly - you should be able to fix that by running make fmt.

@EdgeN8v
Copy link
Author

EdgeN8v commented Jan 6, 2026

@axw Done. I've ran make fmt.

Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working to address this @EdgeN8v.

Since we have a while until the next release, I'd like to see if we can more robustly address this. If someone else would like to get this in and revisit it later, feel free to dismiss my review.

if err := proctelemetry.RegisterProcessMetrics(srv.telemetrySettings); err != nil {
return nil, fmt.Errorf("failed to register process metrics: %w", err)
// Only register process metrics on supported OSes.
if runtime.GOOS == osAIX {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a little difficulty understanding a few things here, could you help explain it to me?

In my mind, we should take a look at OS support for processes in gopsutil and do one of the following if we know there's no process support:

  1. Fail fast and error out, require the user to disable process metrics for that OS.
  2. Print a warn-level message indicating to the user that these metrics are disabled.

In my eyes, we should opt for (1) and require users to acknowledge they will not receive process metrics on their platform. I know the issue mentions a specific PR that caused a regression and introduced this issue, is there a reason we can't fix that? To me, that seems like the better place to address this.

If that is not an option, why is AIX special here? I would want us to instead only allow the operating systems we know we support (as indicated by the gopsutil file I linked above where I believe the error comes from), and everything else will print the warning message. However, if possible, I would prefer to fix the root issue and not register process metrics if they are disabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @evan-bradley, thanks for the detailed review!

I've accepted the changelog suggestion.

Regarding the logic: I implemented the specific osAIX check based on the previous discussion to address the immediate regression reported in the issue, while trying to keep the change minimal.

I agree that an allowlist approach (or failing fast with an explicit error message) is more robust architecturally. However, since @dmathieu has already approved the current implementation, I want to be careful about changing the scope.

Do you prefer I refactor this PR to implement the strict allowlist logic (e.g., checking against supported OSes like Linux/Windows/Darwin), or should we merge this immediate fix for AIX first and handle the broader refactoring in a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My approval doesn't prevent further refactors. An allowlist makes sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the logic to use an explicit allowlist (Linux, Windows, Darwin) as suggested. The osAIX constant has also been removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR accordingly: process metrics are now gated via build tags (linux/windows/darwin). Other OSes log a warning and continue startup. Error wrapping has also been restored on supported OSes.

@dmathieu dmathieu self-requested a review January 7, 2026 12:22
@dmathieu
Copy link
Member

dmathieu commented Jan 7, 2026

Your latest changes removed the warning when the metrics are disabled. We do want that behavior.
Also, it seems to me than @evan-bradley's suggestion was to use build tags to have the allow list (as gopsutil does), not to keep the strings list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Internal telemetry breaks compiled binaries on unsupported OSes

4 participants