Bound avoided-services metric cardinality#2405
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2405 +/- ##
==========================================
+ Coverage 69.00% 69.08% +0.07%
==========================================
Files 331 332 +1
Lines 43529 43602 +73
==========================================
+ Hits 30039 30123 +84
+ Misses 11694 11670 -24
- Partials 1796 1809 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Looks great! However, do we really need to add some high cardinality attributes like the service instance ID? Maybe instead of a limiter, we can just:
- Remove unnecessary attributes in the metric. Internal metrics are usually atributeless counters.
- Wrap the metric inside an
Expirer(pkg/export/otel/expirer.go), as we already do for the rest of metric. This would prevent this metric to grow indefinitely over time.
|
Thanks @mariomac. I agree with removing After checking the OTel service semantic conventions again, I do not think So I think the right shape for this PR is:
That keeps the metric useful for diagnosing which logical services OBI skipped, while avoiding the per-instance churn and still bounding what we expose to backends. |
|
Pushed the follow-up in Changes made:
Validation:
|
Add a dedicated limiter for `
Avoided-services internal metrics should not expose per-instance service identity because that turns local service churn into backend-visible time series churn. The limiter now keys only on logical service namespace, service name, and telemetry type, while Prometheus and OTEL exporters omit the instance attribute entirely. The docs and tests now cover the lower-cardinality metric shape.
6174f39 to
926fa13
Compare
Summary
This bounds the cardinality of the internal
avoided_servicesmetric so repeated detection of already-instrumented services cannot create unbounded backend-visible series.The change adds a small internal limiter shared by the Prometheus and OTEL internal metrics reporters. The limiter keeps normal labels for the logical service (
service.name,service.namespace) and avoided telemetry type until the configured series limit is reached, then records additional detections through the OpenTelemetry overflow attribute (otel.metric.overflow=true). The Prometheus exporter uses the equivalentotel_metric_overflowlabel.The metric intentionally does not report
service.instance.id/service_instance_id; that value is unique per service instance and would turn service-instance churn into time-series churn in Prometheus-compatible backends.The new configuration lives under
internal_metrics.avoided_services:disabled: disables the avoided-services internal metriclimit: bounds avoided-services metric series, including the overflow series; defaults to the OTel metric cardinality default of2000Docs and generated config schema have been updated alongside focused Prometheus and OTEL internal-metrics tests.
Fixes #2037.
Validation
go test ./pkg/internal/avoidedsvc ./pkg/export/imetrics ./pkg/export/otelgo test ./pkg/obimake check-config-schemamake lint-markdown