feat(exporters): replace opencensus with opentelemetry by daveoy · Pull Request #1141 · kubernetes/node-problem-detector

daveoy · 2025-09-26T16:43:47Z

should close #1008

replace opencensus with opentelemetry in: promethetus exporter, stackdriver exporter and metrics abstraction utilities used by other components
fake metrics replaced to preserve problem metrics manager stub used for testing

to note:

daemon and exporter initialization have been reordered since problemmetrics global manager initialization order matters (we need to register all exporters with the global manager before starting the problem daemon -- this has been run on a few clusters in my env and does not affect node-problem-detector
make test is passing
make lint is passing
i have no way of testing the stackdriver / google compute engine metrics refactor, i hope the e2e test provides this?
systemstatsmonitor, healthchecker etc are not in use in my environment and i don't know if the tests are enough to ensure they're still working too (some common components, i.e. the metrics_int64 and float64 representations might be at play)

k8s-ci-robot · 2025-09-26T16:43:57Z

Hi @daveoy. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-09-26T16:43:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: daveoy
Once this PR has been reviewed and has the lgtm label, please assign random-liu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

daveoy · 2025-09-26T16:45:02Z

@wangzhen127 can i get an /ok-to-test please

hakman · 2025-09-26T17:44:26Z

/ok-to-test

daveoy · 2025-09-26T18:44:11Z

e2e test is complaining about the presence of otel's default go_ namespaced metrics -- we can configure them off they are new/default with otel.

daveoy · 2025-09-27T01:41:33Z

@hakman @wangzhen127 looks like we're ready for review over here

be gentle

daveoy · 2025-09-29T14:37:38Z

Can we get a review?

hakman · 2025-09-30T19:59:52Z

@daveoy Sorry, but I am not very familiar with this area. I can give it a try, but probably @wangzhen127 or someone from Google has better context.
CC @SergeyKanzhelev

wangzhen127 · 2025-10-02T05:00:58Z

Thanks for uploading this PR! Sorry for the delay. Got busy in the past week. Will try to get to this within this week.

SergeyKanzhelev · 2025-10-03T17:52:18Z

@dashpole can you help review or suggest somebody?

SergeyKanzhelev · 2025-10-03T17:54:28Z

i have no way of testing the stackdriver / google compute engine metrics refactor, i hope the e2e test provides this?

I would think we should start testing OpenTelemetry metrics with OpenTelemetry collector. It should be reasonably easy to integrate to e2e. @dashpole are there a good examples where OTel colelctor is integrated in e2e tests?

We may also eliminate the Stackdriver portion since Otel protocol must be reasonalbly universal

SergeyKanzhelev · 2025-10-03T17:56:43Z

pkg/util/otel/resource_test.go

@@ -0,0 +1,52 @@
+/*
+Copyright 2019 The Kubernetes Authors All rights reserved.


unrelated: we may need to eliminate the year completely

SergeyKanzhelev · 2025-10-03T17:59:50Z

go.mod

-	go.opentelemetry.io/otel/trace v1.36.0 // indirect
+	go.opentelemetry.io/otel/trace v1.37.0 // indirect
 	go.yaml.in/yaml/v2 v2.4.2 // indirect
 	go.yaml.in/yaml/v3 v3.0.4 // indirect


I thought one of these will be eliminated after opencensus removed. Are there more dependencies on older one?

maybe I was thinking about the gopkg.in/yaml

these different yaml import are a big mix of direct and indirect dependencies.

wangzhen127 · 2025-10-03T18:00:38Z

Adding @rkolchmeyer as well

dashpole · 2025-10-03T18:09:40Z

Not sure if you've seen https://opentelemetry.io/docs/specs/otel/compatibility/opencensus/#migration-path, but that might be helpful.

dashpole · 2025-10-03T18:11:16Z

You can test OTLP in an integration test similar to this, but for metrics: https://github.com/kubernetes/kubernetes/blob/9624c3dcdc9a9e1286e8ca32d07d231e69ed2f0c/test/integration/apiserver/tracing/tracing_test.go#L354. Not sure if integrating the OTel collector will be easier or not.

dashpole · 2025-10-03T18:15:09Z

For unit tests, consider using go.opentelemetry.io/otel/sdk/metric/metricdata/metricdatatest. The test can set up an SDK with a ManualReader, manually call Collect, and then make assertions on the resulting metric data.

dashpole · 2025-10-03T18:17:37Z

If you need to write integration tests for google cloud monitoring, you can do something similar to https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/7130d1aded77c51f613a9d949f166d119cd595e9/internal/cloudmock/metrics.go

dashpole · 2025-10-03T18:21:43Z

pkg/util/otel/common.go

+}
+
+// MeterName is the standard meter name used across the application
+const MeterName = "node-problem-detector"


We usually recommend that the meter name is the package name where metrics are defined. But given this is used as an implementation of your metric interface, that might not make sense.

dashpole · 2025-10-03T18:24:35Z

Overall, this looks correct to me. With better testing of the outputs (ideally done as a separate PR prior to this migration PR), including prometheus and GCM you could do this migration with higher confidence that it isn't breaking anything

daveoy · 2025-10-03T18:35:03Z

thank you all for your comments, keep them coming!

i should mention that this is running in my environment and the problemdaemons and metrics (problem_counter and problem_gauge) are operating as expected.

i'll see if i can mock the Google cloud monitoring stuff for higher confidence on the stackdriver side.

rkolchmeyer

It looks like there are issues on GCE:

rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: timeSeries[57] (metric.type="compute.googleapis.com/guest/disk/operation_bytes_count", metric.labels={"direction": "write", "device_name"
: "sda8", "service_name": "node-problem-detector"}): unrecognized metric labels [service_name];

and that error repeats for a bunch of other metrics. I guess this PR adds a new "service_name" label that Cloud Monitoring doesn't expect. Is that label necessary for all opentelemetry clients, or can node-problem-detector avoid sending that on GCE?

rkolchmeyer · 2025-10-03T19:02:11Z

pkg/util/metrics/metric_int64.go

+	Record(labelValues map[string]string, value int64) error
 }

-// NewInt64Metric create a Int64Metric metric, returns nil when viewName is empty.


It looks like the "returns nil when viewName is empty" behavior wasn't retained - the program exits in this situation in the new implementation. Would it make sense to keep this behavior? Same for the float64 metric.

this should be addressed, thanks for the note

dashpole · 2025-10-03T20:14:16Z

YOu will want to add the WithFilteredResourceAttributes(NoAttributes()) option to the google cloud exporter to fix #1141 (review)

wangzhen127 · 2025-11-14T05:13:26Z

@daveoy Do you get any chance to make progress on this PR? Just checking if we could make this into 1.35. :)

k8s-triage-robot · 2026-02-12T05:20:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wangzhen127 · 2026-02-12T22:39:01Z

/remove-lifecycle stale

@daveoy have you got time to move this forward? :)

daveoy · 2026-02-13T00:47:10Z

Yessir, I will pick this up tomorrow and see if I can’t get these last few things closed out by end of next week. Apologies for the delay

daveoy · 2026-02-17T15:07:14Z

pkg/exporters/stackdriver/stackdriver_exporter.go

+	gcpExporter, err := gcpmetric.New(
+		gcpmetric.WithProjectID(se.config.GCEMetadata.ProjectID),
+		gcpmetric.WithMetricDescriptorTypeFormatter(se.getMetricTypeFormatter()),
+		gcpmetric.WithFilteredResourceAttributes(gcpmetric.NoAttributes),


should address #1141 (comment)

daveoy · 2026-02-17T15:07:56Z

working on the rebase now

* feat(exporters): replace opencensus with opentelemetry * fix(stackdriver_exporter): use metrics util constants * chore: use global meter and init * feat(exporters): add resource info, use global meter * chore: update deps * fix: use a gauge directly, narrow scope info * cleanup: deps * fix: try initOnce pattern for global problem metrics * fix(problemmetrics): init is being called too late * fix(gauge): remove _ratio suffix * fix: clean up after rebase * fix: cleanup after rebase * fix: cleanup after rebase * fix: cleanup after rebase * fix: remove nodename * fix(test): remove instance id check * fix(test): add attr values * fix(test): unexported resources * fix(test): initialize global problem metrics manager * fix(lint): make the linter happy

daveoy · 2026-02-17T15:40:19Z

okay, i big time cheesed up that rebase but i think its better now. waiting on tests

daveoy · 2026-02-17T16:11:12Z

If you need to write integration tests for google cloud monitoring, you can do something similar to https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/7130d1aded77c51f613a9d949f166d119cd595e9/internal/cloudmock/metrics.go

should this be moved out of internal? or shall i recreate here so we can build on it

dashpole · 2026-02-17T16:13:12Z

feel free to recreate it here

dashpole · 2026-02-17T16:15:57Z

https://cloud.google.com/blog/products/management-tools/otlp-opentelemetry-protocol-for-google-cloud-monitoring-metrics is probably relevant here. Cloud monitoring now supports OTLP. The only "special" thing you need to do is use gcp auth with the OTLP grpc exporter (example).

By default, it writes to google managed prometheus, which is different from what the exporter you currently use exports to (it is cheaper, and writes to a prometheus_target resource). I'm not sure what you are trying to maintain consistency with, so you might need a bit more config on the OTLP exporter to have exact parity.

daveoy · 2026-02-17T16:21:50Z

i'll let @wangzhen127 weigh in on that -- mostly opencensus+stackdriver stuff to be otel+stackdriver(gcm). happy to explore using OTLP here but wonder if it may cause issues in some customer envs over at GCP?

daveoy · 2026-02-17T16:22:22Z

need some GCP eyes on the failing GCP e2e stuff -- looks like unrelated errors to me

daveoy · 2026-02-17T18:21:42Z

need some GCP eyes on the failing GCP e2e stuff -- looks like unrelated errors to me

ope, they appear happier now? flaky tests maybe?

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 26, 2025

k8s-ci-robot requested review from derekwaynecarr and klueska September 26, 2025 16:43

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 26, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 26, 2025

hakman mentioned this pull request Sep 28, 2025

Upgrade AWS SDK for Go from v1 to v2 #1130

Open

SergeyKanzhelev reviewed Oct 3, 2025

View reviewed changes

dashpole reviewed Oct 3, 2025

View reviewed changes

rkolchmeyer reviewed Oct 3, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025

SergeyKanzhelev mentioned this pull request Oct 9, 2025

Support short-lived token authentication for Stackdriver #1155

Closed

MartinForReal mentioned this pull request Dec 25, 2025

Replace OpenCensus with OpenTelemetry for metrics MartinForReal/node-problem-detector#1

Draft

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 12, 2026

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 12, 2026

daveoy commented Feb 17, 2026

View reviewed changes

daveoy and others added 4 commits February 17, 2026 10:15

fix(test): allow for summary and histogram types to be emit

40785d6

fix(otel): remove otel_scope labels

f5d8a1e

fix(stackdriver_exporter): filter resource attributes for gcp

ab4e4a5

daveoy force-pushed the dy/pr-1008 branch from eb08687 to ab4e4a5 Compare February 17, 2026 15:19

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2026

daveoy added 2 commits February 17, 2026 10:23

fix(metrics): retain early nil return with no name

4450030

chore(exporters): remove deprecated prometheus opts, use otel translator

624bb6c

daveoy force-pushed the dy/pr-1008 branch from 66761d6 to 624bb6c Compare February 17, 2026 15:39

feat(metric): use metricdata for unit tests

6d1bbf9

fix(metrics/test): test our New() funcs

82a3b32

		@@ -0,0 +1,52 @@
		/*
		Copyright 2019 The Kubernetes Authors All rights reserved.

Conversation

daveoy commented Sep 26, 2025

Uh oh!

k8s-ci-robot commented Sep 26, 2025

Uh oh!

k8s-ci-robot commented Sep 26, 2025

Uh oh!

daveoy commented Sep 26, 2025

Uh oh!

hakman commented Sep 26, 2025

Uh oh!

daveoy commented Sep 26, 2025

Uh oh!

daveoy commented Sep 27, 2025

Uh oh!

daveoy commented Sep 29, 2025

Uh oh!

hakman commented Sep 30, 2025

Uh oh!

wangzhen127 commented Oct 2, 2025

Uh oh!

SergeyKanzhelev commented Oct 3, 2025

Uh oh!

SergeyKanzhelev commented Oct 3, 2025

Uh oh!

SergeyKanzhelev Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangzhen127 commented Oct 3, 2025

Uh oh!

dashpole commented Oct 3, 2025

Uh oh!

dashpole commented Oct 3, 2025

Uh oh!

dashpole commented Oct 3, 2025

Uh oh!

dashpole commented Oct 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dashpole commented Oct 3, 2025

Uh oh!

daveoy commented Oct 3, 2025

Uh oh!

rkolchmeyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dashpole commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangzhen127 commented Nov 14, 2025

Uh oh!

k8s-triage-robot commented Feb 12, 2026

Uh oh!

wangzhen127 commented Feb 12, 2026

Uh oh!

daveoy commented Feb 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daveoy commented Feb 17, 2026

Uh oh!

daveoy commented Feb 17, 2026

Uh oh!

daveoy commented Feb 17, 2026

Uh oh!

dashpole commented Feb 17, 2026

SergeyKanzhelev Oct 3, 2025 •

edited

Loading

dashpole commented Oct 3, 2025 •

edited

Loading

dashpole commented Feb 17, 2026 •

edited

Loading