feat(exporters): replace opencensus with opentelemetry#1141
feat(exporters): replace opencensus with opentelemetry#1141daveoy wants to merge 8 commits intokubernetes:masterfrom
Conversation
|
Hi @daveoy. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: daveoy The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@wangzhen127 can i get an /ok-to-test please |
|
/ok-to-test |
|
e2e test is complaining about the presence of otel's default |
|
@hakman @wangzhen127 looks like we're ready for review over here be gentle |
|
Can we get a review? |
|
@daveoy Sorry, but I am not very familiar with this area. I can give it a try, but probably @wangzhen127 or someone from Google has better context. |
|
Thanks for uploading this PR! Sorry for the delay. Got busy in the past week. Will try to get to this within this week. |
|
@dashpole can you help review or suggest somebody? |
I would think we should start testing OpenTelemetry metrics with OpenTelemetry collector. It should be reasonably easy to integrate to e2e. @dashpole are there a good examples where OTel colelctor is integrated in e2e tests? We may also eliminate the Stackdriver portion since Otel protocol must be reasonalbly universal |
| @@ -0,0 +1,52 @@ | |||
| /* | |||
| Copyright 2019 The Kubernetes Authors All rights reserved. | |||
There was a problem hiding this comment.
unrelated: we may need to eliminate the year completely
| go.opentelemetry.io/otel/trace v1.36.0 // indirect | ||
| go.opentelemetry.io/otel/trace v1.37.0 // indirect | ||
| go.yaml.in/yaml/v2 v2.4.2 // indirect | ||
| go.yaml.in/yaml/v3 v3.0.4 // indirect |
There was a problem hiding this comment.
I thought one of these will be eliminated after opencensus removed. Are there more dependencies on older one?
There was a problem hiding this comment.
maybe I was thinking about the gopkg.in/yaml
There was a problem hiding this comment.
these different yaml import are a big mix of direct and indirect dependencies.
|
Adding @rkolchmeyer as well |
|
Not sure if you've seen https://opentelemetry.io/docs/specs/otel/compatibility/opencensus/#migration-path, but that might be helpful. |
|
You can test OTLP in an integration test similar to this, but for metrics: https://github.com/kubernetes/kubernetes/blob/9624c3dcdc9a9e1286e8ca32d07d231e69ed2f0c/test/integration/apiserver/tracing/tracing_test.go#L354. Not sure if integrating the OTel collector will be easier or not. |
|
For unit tests, consider using go.opentelemetry.io/otel/sdk/metric/metricdata/metricdatatest. The test can set up an SDK with a |
|
If you need to write integration tests for google cloud monitoring, you can do something similar to https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/7130d1aded77c51f613a9d949f166d119cd595e9/internal/cloudmock/metrics.go |
| } | ||
|
|
||
| // MeterName is the standard meter name used across the application | ||
| const MeterName = "node-problem-detector" |
There was a problem hiding this comment.
We usually recommend that the meter name is the package name where metrics are defined. But given this is used as an implementation of your metric interface, that might not make sense.
|
Overall, this looks correct to me. With better testing of the outputs (ideally done as a separate PR prior to this migration PR), including prometheus and GCM you could do this migration with higher confidence that it isn't breaking anything |
|
thank you all for your comments, keep them coming! i should mention that this is running in my environment and the problemdaemons and metrics (problem_counter and problem_gauge) are operating as expected. i'll see if i can mock the Google cloud monitoring stuff for higher confidence on the stackdriver side. |
rkolchmeyer
left a comment
There was a problem hiding this comment.
It looks like there are issues on GCE:
rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: timeSeries[57] (metric.type="compute.googleapis.com/guest/disk/operation_bytes_count", metric.labels={"direction": "write", "device_name"
: "sda8", "service_name": "node-problem-detector"}): unrecognized metric labels [service_name];
and that error repeats for a bunch of other metrics. I guess this PR adds a new "service_name" label that Cloud Monitoring doesn't expect. Is that label necessary for all opentelemetry clients, or can node-problem-detector avoid sending that on GCE?
| Record(labelValues map[string]string, value int64) error | ||
| } | ||
|
|
||
| // NewInt64Metric create a Int64Metric metric, returns nil when viewName is empty. |
There was a problem hiding this comment.
It looks like the "returns nil when viewName is empty" behavior wasn't retained - the program exits in this situation in the new implementation. Would it make sense to keep this behavior? Same for the float64 metric.
There was a problem hiding this comment.
this should be addressed, thanks for the note
|
YOu will want to add the |
|
@daveoy Do you get any chance to make progress on this PR? Just checking if we could make this into 1.35. :) |
|
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
/remove-lifecycle stale @daveoy have you got time to move this forward? :) |
|
Yessir, I will pick this up tomorrow and see if I can’t get these last few things closed out by end of next week. Apologies for the delay |
| gcpExporter, err := gcpmetric.New( | ||
| gcpmetric.WithProjectID(se.config.GCEMetadata.ProjectID), | ||
| gcpmetric.WithMetricDescriptorTypeFormatter(se.getMetricTypeFormatter()), | ||
| gcpmetric.WithFilteredResourceAttributes(gcpmetric.NoAttributes), |
|
working on the rebase now |
* feat(exporters): replace opencensus with opentelemetry * fix(stackdriver_exporter): use metrics util constants * chore: use global meter and init * feat(exporters): add resource info, use global meter * chore: update deps * fix: use a gauge directly, narrow scope info * cleanup: deps * fix: try initOnce pattern for global problem metrics * fix(problemmetrics): init is being called too late * fix(gauge): remove _ratio suffix * fix: clean up after rebase * fix: cleanup after rebase * fix: cleanup after rebase * fix: cleanup after rebase * fix: remove nodename * fix(test): remove instance id check * fix(test): add attr values * fix(test): unexported resources * fix(test): initialize global problem metrics manager * fix(lint): make the linter happy
|
okay, i big time cheesed up that rebase but i think its better now. waiting on tests |
should this be moved out of internal? or shall i recreate here so we can build on it |
|
feel free to recreate it here |
|
https://cloud.google.com/blog/products/management-tools/otlp-opentelemetry-protocol-for-google-cloud-monitoring-metrics is probably relevant here. Cloud monitoring now supports OTLP. The only "special" thing you need to do is use gcp auth with the OTLP grpc exporter (example). By default, it writes to google managed prometheus, which is different from what the exporter you currently use exports to (it is cheaper, and writes to a prometheus_target resource). I'm not sure what you are trying to maintain consistency with, so you might need a bit more config on the OTLP exporter to have exact parity. |
|
i'll let @wangzhen127 weigh in on that -- mostly opencensus+stackdriver stuff to be otel+stackdriver(gcm). happy to explore using OTLP here but wonder if it may cause issues in some customer envs over at GCP? |
|
need some GCP eyes on the failing GCP e2e stuff -- looks like unrelated errors to me |
should close #1008
to note: