Skip to content

feat(metrics): add bound instruments behind experimental feature flag#3421

Open
bryantbiggs wants to merge 11 commits intoopen-telemetry:mainfrom
bryantbiggs:feat/bound-instruments
Open

feat(metrics): add bound instruments behind experimental feature flag#3421
bryantbiggs wants to merge 11 commits intoopen-telemetry:mainfrom
bryantbiggs:feat/bound-instruments

Conversation

@bryantbiggs
Copy link
Copy Markdown
Contributor

@bryantbiggs bryantbiggs commented Mar 14, 2026

Splits from #3392, per maintainer feedback. Built on top of #3420 (in-place delta collection, now merged). Refs #1374.

Summary

Adds BoundCounter and BoundHistogram to the public API behind the experimental_metrics_bound_instruments feature flag. These types cache the resolved aggregator reference for a fixed attribute set, allowing subsequent measurements to bypass per-call sort, dedup, hash, and HashMap lookup entirely.

let counter = meter.u64_counter("requests").build();
let bound = counter.bind(&[KeyValue::new("method", "GET")]);
bound.add(1); // ~1.9ns — no attribute lookup

Architecture

TrackerEntry bound_count

bound_count (introduced in #3420) tracks how many live BoundCounter/BoundHistogram handles reference an entry. Entries with bound_count > 0 are never evicted during delta collection.

Cardinality overflow handling

ValueMap::bind() returns Option>None when at the cardinality limit. Bound handle types use a Direct/Fallback enum:

  • Direct: normal case — dedicated tracker, single-digit ns, no map lookup per call
  • Fallback: at overflow — delegates every call to the unbound Measure::call() path

Feature flag

All public API, traits, noop impls, and internal plumbing are gated behind experimental_metrics_bound_instruments.

Benchmark Results

Apple M4 Max, 16 cores (12 performance + 4 efficiency), macOS 15.4:

Benchmark Time vs Unbound
Counter Unbound 53.2 ns
Counter Bound 1.87 ns ~28x faster
Histogram Unbound 58.6 ns
Histogram Bound 6.57 ns ~8.9x faster
EC2 Counter Results (from #3392)
Instance Unbound Bound Speedup
c7i.large 86.2ns 7.47ns 11.5x
c7i.xlarge 88.9ns 7.62ns 11.7x
c7i.4xlarge 90.1ns 7.65ns 11.8x
c7a.large 77.4ns 2.71ns 28.6x
c7a.xlarge 75.0ns 2.71ns 27.7x
c7a.4xlarge 75.0ns 2.71ns 27.7x
c7g.large 111.8ns 6.87ns 16.3x
c7g.xlarge 111.4ns 6.84ns 16.3x
c7g.4xlarge 112.9ns 6.73ns 16.8x
EC2 Histogram Results (from #3392)
Instance Unbound Bound Speedup
c7i.large 97.1ns 17.1ns 5.7x
c7i.xlarge 99.0ns 17.9ns 5.5x
c7i.4xlarge 98.5ns 18.4ns 5.4x
c7a.large 92.5ns 13.7ns 6.8x
c7a.xlarge 92.4ns 13.7ns 6.7x
c7a.4xlarge 92.5ns 13.7ns 6.8x
c7g.large 133.2ns 27.7ns 4.8x
c7g.xlarge 133.5ns 27.6ns 4.8x
c7g.4xlarge 133.8ns 27.2ns 4.9x

Test Coverage

15 tests covering:

  • Cumulative and delta temporality for both Counter and Histogram
  • Bound + unbound sharing the same data point (same attribute set)
  • Idle delta cycles (no update = no export, but handle persists)
  • Cardinality overflow fallback to unbound path (Counter and Histogram)
  • Recovery after delta eviction frees cardinality space
  • Overflow fallback handles working across multiple delta cycles
  • Drop enabling eviction of stale entries
  • Multiple bound handles sharing the same tracker (ref-counting)
  • Binding with empty attributes

Test plan

  • All 164 metrics tests pass (149 existing + 15 new)
  • Compiles with and without the feature flag
  • Zero regression on unbound measurement path

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 14, 2026

Codecov Report

❌ Patch coverage is 91.98347% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.0%. Comparing base (34837c8) to head (1f02cd8).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-sdk/src/metrics/mod.rs 93.7% 57 Missing ⚠️
opentelemetry/src/metrics/noop.rs 0.0% 8 Missing ⚠️
...pentelemetry-sdk/src/metrics/internal/aggregate.rs 0.0% 6 Missing ⚠️
opentelemetry-sdk/src/metrics/noop.rs 0.0% 5 Missing ⚠️
opentelemetry/src/metrics/instruments/histogram.rs 60.0% 4 Missing ⚠️
opentelemetry/src/metrics/instruments/mod.rs 0.0% 4 Missing ⚠️
opentelemetry-sdk/src/metrics/internal/mod.rs 95.5% 3 Missing ⚠️
opentelemetry/src/metrics/instruments/counter.rs 66.6% 3 Missing ⚠️
...entelemetry-sdk/src/metrics/internal/last_value.rs 96.5% 2 Missing ⚠️
...emetry-sdk/src/metrics/internal/precomputed_sum.rs 96.4% 2 Missing ⚠️
... and 3 more
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #3421     +/-   ##
=======================================
+ Coverage   83.7%   84.0%   +0.3%     
=======================================
  Files        126     126             
  Lines      25386   26592   +1206     
=======================================
+ Hits       21255   22363   +1108     
- Misses      4131    4229     +98     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bryantbiggs bryantbiggs force-pushed the feat/bound-instruments branch 3 times, most recently from f8a03dc to 08649e8 Compare March 15, 2026 00:42
@cijothomas
Copy link
Copy Markdown
Member

#3420 is merged. @bryantbiggs can you rebase? The spec is progressing in parallel (open-telemetry/opentelemetry-specification#5050), so this is perfect time to get this merged!

Comment thread opentelemetry-sdk/Cargo.toml Outdated
Comment thread opentelemetry/src/metrics/instruments/mod.rs Outdated
Comment thread opentelemetry-sdk/src/metrics/internal/mod.rs
Comment thread opentelemetry-sdk/src/metrics/internal/mod.rs
Comment thread opentelemetry-sdk/src/metrics/mod.rs
@bryantbiggs bryantbiggs force-pushed the feat/bound-instruments branch from 91cb61a to b0f4777 Compare April 29, 2026 14:18
@bryantbiggs
Copy link
Copy Markdown
Contributor Author

Pushed a round of fixes addressing all five comments; reply inline on the bound_count one with my reasoning for keeping ref-counting.

Two extras worth flagging that aren't tied to a comment: ValueMap::bind() now inserts both unsorted and sorted keys for new entries (mirrors measure()'s pattern so unbound calls hit the fast path regardless of which side created the tracker), and BoundCounter/BoundHistogram are now Arc-backed and Clone so a single bound state can be shared across threads without re-binding.

@@ -37,6 +43,51 @@ where
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: bind directly to overflow tracker, eliminate the Fallback path

The current Fallback variant delegates every bound.add() call through the full unbound measure() path (sort → hash → RwLock → HashMap lookup). This creates several problems:

  1. No performance guarantee. A bind() call made before cardinality overflow gets ~2ns direct path. The same call made after overflow gets ~53ns — the user has no way to know which they got. The whole point of bind() is a consistent performance guarantee.

  2. Unpredictable attribution. The Fallback handle calls measure(value, attrs) each time, which means the data could end up in the overflow bucket, or in a dedicated entry if space frees up after delta eviction — and can flip between the two across collection cycles depending on what other unbound callers are doing. That's not "bound" to anything.

  3. Code complexity. The Fallback variant, BoundFallbackHandle, and the fallback: Arc<dyn Measure<T>> parameter threaded through Measure::bind() all exist solely for this path.

Proposed alternative: When ValueMap::bind() hits the cardinality limit, bind directly to the overflow TrackerEntry instead of returning None. This gives the handle the same ~2ns direct write path, consistently lands in the overflow bucket, and removes the Fallback variant entirely. The trade-off is that data stays attributed to overflow permanently (even if space frees up), but:

  • bind() is typically done at startup before cardinality fills up — hitting overflow at bind time is already a degenerate case
  • If a user wants to recover, they can drop and re-bind(), which will get a dedicated tracker if space is available
  • Consistent, predictable behavior is better than accidental self-healing that depends on other callers' patterns

Copy link
Copy Markdown
Contributor Author

@bryantbiggs bryantbiggs Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed two commits.

  1. fix(metrics): bind() at cardinality limit binds directly to overflow (commit 1)

ValueMap::bind_attrs now looks up (or lazily creates) the overflow TrackerEntry at the limit and returns a direct handle. The Fallback variants in BoundSumInner / BoundHistogramInner are removed. Bound handles hold Arc<TrackerEntry> and write directly for their lifetime — same atomic-update hot path as a normal bind. Drop-and-rebind to recover after delta eviction frees space. Spec-aligned with #5050: overflow attribution preserved (MUST), per-call lookup bypassed (SHOULD).

Bench (M4 Max, 3-attr key set):

Before After
Counter_Bound_Delta 1.80 ns 1.84 ns
Counter_Bound_AtOverflow_Delta ~50 ns 1.84 ns
Histogram_Bound_Delta 6.50 ns 6.69 ns
Histogram_Bound_AtOverflow_Delta ~58 ns 6.59 ns

Resolves points 1 and 2. New tests bound_*_overflow_persists_across_eviction_cycles lock in the predictability invariant.

  1. refactor(metrics): extend direct-bind to all aggregators (commit 2)

Removes BoundFallbackHandle and the fallback parameter on Measure::bind entirely. LastValue, PrecomputedSum, and ExpoHistogram now have direct-bind impls matching the Sum/Histogram pattern. PrecomputedSum::bind is unreachable from user code today (async-only) but kept so the trait is uniform and future Gauge / Observable bind() extensions are mechanical. The rare RwLock-poisoned case yields a NoopBoundMeasure (drops silently, same as measure()'s own poison handling).

Tests added: bound ExponentialHistogram via View (delta, cumulative, NaN/inf filter, view filter at bind time, overflow attribution, persistence across delta eviction); cumulative bind-at-overflow for Counter and Histogram; LastValue / PrecomputedSum unit tests through the Measure trait; lock-poison test on ValueMap::bind.

The second commit is the full point-3 cleanup. Happy to keep it here or split into a follow-up if you'd rather land the correctness fix first

Add BoundCounter and BoundHistogram types that cache resolved aggregator
references for a fixed attribute set. Created via Counter::bind() and
Histogram::bind(), bound instruments bypass per-call attribute lookup
for significant performance improvements (~28x for counters, ~9x for
histograms).

Architecture:
- TrackerEntry.bound_count tracks live handles, preventing eviction
- Direct/Fallback enum handles cardinality overflow gracefully
- Unsupported aggregators (ExpoHistogram, LastValue, PrecomputedSum)
  fall back to unbound path instead of panicking

All public API, traits, and internal plumbing are gated behind the
experimental_metrics_bound_instruments feature flag. Includes 15 tests
covering cumulative/delta temporality, overflow fallback, recovery
after eviction, bound+unbound sharing, idle cycles, drop semantics,
and empty attributes.

Splits from open-telemetry#3392. Refs open-telemetry#1374.
cfg-gated imports must sort before non-gated imports of the same module
per rustfmt's stable import ordering rules.
Add bound_count field to TrackerEntry to track live bound instrument
handles. Entries with bound_count > 0 are never evicted during delta
collection, ensuring bound handles always point to a live tracker.

Moved from open-telemetry#3420 per review feedback — bound_count belongs with the
bound instruments feature, not the delta collection refactor.
- Make no_attribute_tracker an Arc<TrackerEntry> so bind(&[]) returns a
  handle to the same tracker that measure(&[]) uses, preventing duplicate
  data points for empty attribute sets
- Re-check bound_count under write lock in collect_and_reset() to prevent
  a TOCTOU race where bind() increments bound_count after collect() reads
  it as 0 under a concurrent read lock
- Use Ordering::Release for has_been_updated stores in bound handles,
  matching the unbound measure() path
CI exercises bound instruments via --all-features, so the testing
feature does not need to enable experimental_metrics_bound_instruments
or pull in the tokio runtime additions.
Foundational OpenTelemetry guarantee is that telemetry must never crash
the host application. A custom SyncInstrument impl that does not override
bind() now returns a NoopBoundSyncInstrument and emits an otel_debug log,
matching the silent-drop behavior of other default impls.
When bind() creates a new tracker, insert it under both the original
and sorted attribute orderings, mirroring measure()'s insert pattern.
Without this, an unbound measure() call with the same attrs in
non-sorted order would miss the fast path and re-sort on every call
when the bound side created the tracker first.
Switch from Box<dyn ...> to Arc<dyn ...> and derive Clone so a single
bound state can be shared across threads or modules without re-binding.
Drop semantics are unchanged: bound_count is decremented when the last
clone is dropped, matching Rust's RAII expectations for handle types.

Mirrors the existing Clone behavior on Counter and Histogram.
Tests:
  * bound_histogram_empty_attributes_shares_with_unbound mirrors the
    existing counter version for histograms.
  * bound_counter_view_filters_attributes_at_bind_time and a histogram
    analog confirm a view filters attributes at bind() time, with bound
    and unbound recordings collapsing onto the same filtered data point.

Bench:
  * Counter_Bound_With_View_Delta confirms view filtering happens at bind
    time, leaving the hot path at unfiltered-bound speed. Updated bench
    required-features and added a note that criterion does not enforce
    regression on its own.

CHANGELOG entries for both crates updated to mention Clone, the noop
trait default, and the converged bound/unbound semantics.
@bryantbiggs bryantbiggs force-pushed the feat/bound-instruments branch from a6ff02d to de9da26 Compare April 30, 2026 18:05
The previous Fallback path re-routed every bound.add() through the
unbound measure() path, creating a ~25x perf cliff at the cardinality
limit (~50ns vs ~1.8ns) and letting attribution drift across delta
eviction cycles as space freed and refilled.

ValueMap::bind() now looks up or lazily creates the overflow
TrackerEntry at the limit and returns a direct handle to it. The bound
handle writes directly to overflow via a single atomic update for its
lifetime — perf parity with a normal bind. To recover after delta
eviction frees space, drop the handle and rebind.

Spec-aligned with open-telemetry/opentelemetry-specification#5050:
overflow data lands in the otel.metric.overflow=true bucket (MUST)
and per-call attribute lookup is bypassed (SHOULD).

Bench (Apple M4 Max):
  Counter_Bound_AtOverflow_Delta:    1.82 ns
  Histogram_Bound_AtOverflow_Delta:  6.58 ns
Removes BoundFallbackHandle and the fallback parameter on
Measure::bind, leaving every aggregator on the same direct-bind
shape: bind() returns a BoundXxxHandle holding an Arc<TrackerEntry>
that writes directly to the aggregator without per-call attribute
lookup.

LastValue, PrecomputedSum, and ExpoHistogram now match the Sum /
Histogram pattern. PrecomputedSum's bind() is unreachable from user
code today (async-only) but the impl exists so the trait is uniform
and future Gauge / Observable bind() extensions are mechanical.

The rare RwLock-poisoned case now yields a NoopBoundMeasure that
drops measurements silently — mirroring measure()'s own poison
handling.

Test coverage added:
- Bound ExponentialHistogram via View: delta with NaN/inf filter,
  bind-at-overflow attribution, persistence across delta eviction,
  view filter applied at bind time, and cumulative accumulation.
- Cumulative bind-at-overflow tests for Counter and Histogram.
- LastValue::bind and PrecomputedSum::bind unit tests exercising
  Measure / BoundMeasure traits and bound_count Drop semantics.
- ValueMap::bind under a poisoned trackers RwLock returns None so
  the caller produces a NoopBoundMeasure rather than panicking.
@bryantbiggs bryantbiggs force-pushed the feat/bound-instruments branch from de9da26 to 1f02cd8 Compare April 30, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants