Skip to content

Bound avoided-services metric cardinality#2405

Open
MrAlias wants to merge 3 commits into
open-telemetry:mainfrom
MrAlias:audit-cardinality-explosions
Open

Bound avoided-services metric cardinality#2405
MrAlias wants to merge 3 commits into
open-telemetry:mainfrom
MrAlias:audit-cardinality-explosions

Conversation

@MrAlias

@MrAlias MrAlias commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

This bounds the cardinality of the internal avoided_services metric so repeated detection of already-instrumented services cannot create unbounded backend-visible series.

The change adds a small internal limiter shared by the Prometheus and OTEL internal metrics reporters. The limiter keeps normal labels for the logical service (service.name, service.namespace) and avoided telemetry type until the configured series limit is reached, then records additional detections through the OpenTelemetry overflow attribute (otel.metric.overflow=true). The Prometheus exporter uses the equivalent otel_metric_overflow label.

The metric intentionally does not report service.instance.id / service_instance_id; that value is unique per service instance and would turn service-instance churn into time-series churn in Prometheus-compatible backends.

The new configuration lives under internal_metrics.avoided_services:

  • disabled: disables the avoided-services internal metric
  • limit: bounds avoided-services metric series, including the overflow series; defaults to the OTel metric cardinality default of 2000

Docs and generated config schema have been updated alongside focused Prometheus and OTEL internal-metrics tests.

Fixes #2037.

Validation

  • go test ./pkg/internal/avoidedsvc ./pkg/export/imetrics ./pkg/export/otel
  • go test ./pkg/obi
  • make check-config-schema
  • make lint-markdown

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.78261% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.08%. Comparing base (da44b74) to head (926fa13).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pkg/internal/avoidedsvc/limiter.go 84.44% 6 Missing and 1 partial ⚠️
pkg/export/otel/metrics_internal.go 77.77% 5 Missing and 1 partial ⚠️
pkg/export/imetrics/iprom.go 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2405      +/-   ##
==========================================
+ Coverage   69.00%   69.08%   +0.07%     
==========================================
  Files         331      332       +1     
  Lines       43529    43602      +73     
==========================================
+ Hits        30039    30123      +84     
+ Misses      11694    11670      -24     
- Partials     1796     1809      +13     
Flag Coverage Δ
integration-test 51.34% <53.33%> (+0.11%) ⬆️
integration-test-arm 27.11% <51.08%> (-0.10%) ⬇️
integration-test-vm-5.15-lts 28.60% <53.33%> (+0.19%) ⬆️
integration-test-vm-6.18-lts 27.73% <53.33%> (-0.63%) ⬇️
k8s-integration-test 37.29% <25.33%> (+0.08%) ⬆️
oats-test 36.62% <0.00%> (+0.27%) ⬆️
unittests 62.55% <94.66%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@MrAlias MrAlias marked this pull request as ready for review June 16, 2026 19:29
@MrAlias MrAlias requested a review from a team as a code owner June 16, 2026 19:29
@MrAlias MrAlias added this to the v0.10.0 milestone Jun 16, 2026
@MrAlias MrAlias added bug Something isn't working go Related to Go code area: metrics Metrics generation, cardinality, and metric-specific behavior labels Jun 16, 2026

@mariomac mariomac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! However, do we really need to add some high cardinality attributes like the service instance ID? Maybe instead of a limiter, we can just:

  • Remove unnecessary attributes in the metric. Internal metrics are usually atributeless counters.
  • Wrap the metric inside an Expirer (pkg/export/otel/expirer.go), as we already do for the rest of metric. This would prevent this metric to grow indefinitely over time.

MrAlias commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @mariomac. I agree with removing service.instance.id here.

After checking the OTel service semantic conventions again, service.instance.id is the wrong dimension for this metric. It is expected to be unique per service instance within a service.namespace / service.name pair, so using it on an always-on internal metric makes the backend-facing cardinality problem much worse.

I do not think Expirer should replace the limiter, though. Expirer helps local exporter state by removing stale labelsets, but it does not protect Prometheus-compatible backends from series they have already scraped. Once a backend has seen a metric name plus labelset, local expiration only stops future samples for that series; it does not undo the historical cardinality until backend retention/compaction catches up.

So I think the right shape for this PR is:

  • remove service.instance.id / service_instance_id from the metric
  • keep service.name, service.namespace, and telemetry.type so the metric still identifies the logical service/signal being avoided
  • keep the pre-export limiter/overflow path so new identities beyond the cap are collapsed before they become backend-visible series
  • treat Expirer as a possible follow-up cleanup improvement, not the cardinality-control mechanism

That keeps the metric useful for diagnosing which logical services OBI skipped, while avoiding the per-instance churn and still bounding what we expose to backends.

MrAlias commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Pushed the follow-up in 6174f39af.

Changes made:

  • removed service.instance.id from the OTEL obi.avoided.services attributes
  • removed service_instance_id from the Prometheus obi_avoided_services labels
  • updated the limiter identity so cardinality is scoped to logical service namespace/name plus telemetry type
  • added a limiter regression test proving instance ID changes do not consume additional cardinality slots
  • updated docs and the internal-metrics integration assertion to match the lower-cardinality metric shape
  • updated the PR description to call out that instance IDs are intentionally omitted because they would create backend-visible time-series churn

Validation:

  • go test ./pkg/internal/avoidedsvc ./pkg/export/imetrics ./pkg/export/otel
  • go test ./pkg/obi
  • make check-config-schema
  • make lint-markdown

MrAlias added 3 commits June 18, 2026 12:29
Add a dedicated limiter for
`
Avoided-services internal metrics should not expose per-instance service
identity because that turns local service churn into backend-visible
time series churn.

The limiter now keys only on logical service namespace, service name,
and telemetry type, while Prometheus and OTEL exporters omit the
instance attribute entirely. The docs and tests now cover the
lower-cardinality metric shape.
@MrAlias MrAlias force-pushed the audit-cardinality-explosions branch from 6174f39 to 926fa13 Compare June 18, 2026 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: metrics Metrics generation, cardinality, and metric-specific behavior bug Something isn't working go Related to Go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

avoided_services has unbounded service-identity cardinality

2 participants