Skip to content

Commit e39f14d

Browse files
authored
[tailsamplingprocessor] Replace misleading policy latency metric (open-telemetry#43510)
#### Description This PR removes the metric otelcol_processor_tail_sampling_sampling_decision_latency. It adds a pair of metrics as replacement called `processor_tail_sampling_sampling_policy_cpu_time` and `processor_tail_sampling_sampling_policy_executions`. It is an implementation of the feedback received in open-telemetry#42620 Originally reported in open-telemetry#38502, this metric does not measure the latency of a particular policy. Instead, it measures the latency since policy evaluation began which is mostly not a useful signal. To make matters worse, profiling shows that recording this metric accounts for >20% of cpu time spent evaluating policies. Since the tailsamplingprocessor is bottlenecked on the single threaded decision loop, this 20% is much better spent on making decisions rather than measuring a misleading metric. As a replacement, I've added a metric to track total time spent on each policy as well as count total executions. This allows slow policies to still be identified by checking their total or average execution time without the heavy CPU / gc pressure / synchronization cost of recording a histogram in the inner loop. <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes open-telemetry#38502 - closed on accident, and I am not otel enough to reopen it
1 parent 29092fa commit e39f14d

File tree

9 files changed

+219
-69
lines changed

9 files changed

+219
-69
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Use this changelog template to create an entry for release notes.
2+
3+
# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
4+
change_type: breaking
5+
6+
# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
7+
component: processor/tail_sampling
8+
9+
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
10+
note: Replace policy latency metric with total time spent executing specific sampling policy.
11+
12+
# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
13+
issues: [42620]
14+
15+
# (Optional) One or more lines of additional information to render under the primary note.
16+
# These lines will be padded with 2 spaces and then inserted directly into the document.
17+
# Use pipe (|) for multiline entries.
18+
subtext: The existing latency metric was misleading and expensive to compute. The new cpu time metric can be used to find expensive policies instead.
19+
20+
# If your change doesn't affect end users or the exported elements of any package,
21+
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
22+
# Optional: The change log or logs in which this entry should be included.
23+
# e.g. '[user]' or '[user, api]'
24+
# Include 'user' if the change is relevant to end users.
25+
# Include 'api' if there is a change to a library API.
26+
# Default: '[user]'
27+
change_logs: []

processor/tailsamplingprocessor/documentation.md

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -75,14 +75,6 @@ Counts the arrival of new traces [Development]
7575
| ---- | ----------- | ---------- | --------- | --------- |
7676
| {traces} | Sum | Int | true | Development |
7777

78-
### otelcol_processor_tail_sampling_sampling_decision_latency
79-
80-
Latency (in microseconds) of a given sampling policy [Development]
81-
82-
| Unit | Metric Type | Value Type | Stability |
83-
| ---- | ----------- | ---------- | --------- |
84-
| µs | Histogram | Int | Development |
85-
8678
### otelcol_processor_tail_sampling_sampling_decision_timer_latency
8779

8880
Latency (in milliseconds) of each run of the sampling decision timer [Development]
@@ -107,6 +99,34 @@ Count of sampling policy evaluation errors [Development]
10799
| ---- | ----------- | ---------- | --------- | --------- |
108100
| {errors} | Sum | Int | true | Development |
109101

102+
### otelcol_processor_tail_sampling_sampling_policy_execution_count
103+
104+
Total number of executions of a specific sampling policy [Development]
105+
106+
| Unit | Metric Type | Value Type | Monotonic | Stability |
107+
| ---- | ----------- | ---------- | --------- | --------- |
108+
| {executions} | Sum | Int | true | Development |
109+
110+
#### Attributes
111+
112+
| Name | Description | Values |
113+
| ---- | ----------- | ------ |
114+
| policy | Name of the policy | Any Str |
115+
116+
### otelcol_processor_tail_sampling_sampling_policy_execution_time_sum
117+
118+
Total time spent (in microseconds) executing a specific sampling policy [Development]
119+
120+
| Unit | Metric Type | Value Type | Monotonic | Stability |
121+
| ---- | ----------- | ---------- | --------- | --------- |
122+
| µs | Sum | Int | true | Development |
123+
124+
#### Attributes
125+
126+
| Name | Description | Values |
127+
| ---- | ----------- | ------ |
128+
| policy | Name of the policy | Any Str |
129+
110130
### otelcol_processor_tail_sampling_sampling_trace_dropped_too_early
111131

112132
Count of traces that needed to be dropped before the configured wait time [Development]

processor/tailsamplingprocessor/internal/metadata/generated_telemetry.go

Lines changed: 14 additions & 8 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

processor/tailsamplingprocessor/internal/metadatatest/generated_telemetrytest.go

Lines changed: 32 additions & 15 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

processor/tailsamplingprocessor/internal/metadatatest/generated_telemetrytest_test.go

Lines changed: 8 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

processor/tailsamplingprocessor/metadata.yaml

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -83,16 +83,6 @@ telemetry:
8383
value_type: int
8484
monotonic: true
8585

86-
processor_tail_sampling_sampling_decision_latency:
87-
description: Latency (in microseconds) of a given sampling policy
88-
stability:
89-
level: development
90-
unit: µs
91-
enabled: true
92-
histogram:
93-
value_type: int
94-
bucket_boundaries: [1, 2, 5, 10, 25, 50, 75, 100, 150, 200, 300, 400, 500, 750, 1000, 2000, 3000, 4000, 5000, 10000, 20000, 30000, 50000]
95-
9686
processor_tail_sampling_sampling_decision_timer_latency:
9787
description: Latency (in milliseconds) of each run of the sampling decision timer
9888
stability:
@@ -102,6 +92,7 @@ telemetry:
10292
histogram:
10393
value_type: int
10494
bucket_boundaries: [1, 2, 5, 10, 25, 50, 75, 100, 150, 200, 300, 400, 500, 750, 1000, 2000, 3000, 4000, 5000, 10000, 20000, 30000, 50000]
95+
10596
processor_tail_sampling_sampling_late_span_age:
10697
description: Time (in seconds) from the sampling decision was taken and the arrival of a late span
10798
stability:
@@ -121,6 +112,27 @@ telemetry:
121112
value_type: int
122113
monotonic: true
123114

115+
processor_tail_sampling_sampling_policy_execution_count:
116+
description: Total number of executions of a specific sampling policy
117+
stability:
118+
level: development
119+
unit: "{executions}"
120+
enabled: true
121+
sum:
122+
value_type: int
123+
monotonic: true
124+
attributes: [policy]
125+
126+
processor_tail_sampling_sampling_policy_execution_time_sum:
127+
description: Total time spent (in microseconds) executing a specific sampling policy
128+
stability:
129+
level: development
130+
unit: µs
131+
enabled: true
132+
sum:
133+
value_type: int
134+
monotonic: true
135+
attributes: [policy]
124136

125137
processor_tail_sampling_sampling_trace_dropped_too_early:
126138
description: Count of traces that needed to be dropped before the configured wait time

processor/tailsamplingprocessor/processor.go

Lines changed: 34 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -269,28 +269,45 @@ type policyDecisionMetrics struct {
269269
spansSampled int64
270270
}
271271

272-
type policyMetrics struct {
272+
type policyTickMetrics struct {
273273
idNotFoundOnMapCount, evaluateErrorCount, decisionSampled, decisionNotSampled, decisionDropped int64
274274
tracesSampledByPolicyDecision []map[samplingpolicy.Decision]policyDecisionMetrics
275+
cumulativeExecutionTime []perPolicyExecutionTime
275276
}
276277

277-
func newPolicyMetrics(numPolicies int) *policyMetrics {
278+
// perPolicyExecutionTime is a struct for holding the cumulative execution time
279+
// and number of executions of a policy. This is an optimization to avoid
280+
// instrumentation overhead in the decision making loop.
281+
type perPolicyExecutionTime struct {
282+
executionTime time.Duration
283+
executionCount int64
284+
}
285+
286+
func newPolicyTickMetrics(numPolicies int) *policyTickMetrics {
278287
tracesSampledByPolicyDecision := make([]map[samplingpolicy.Decision]policyDecisionMetrics, numPolicies)
279288
for i := range tracesSampledByPolicyDecision {
280289
tracesSampledByPolicyDecision[i] = make(map[samplingpolicy.Decision]policyDecisionMetrics)
281290
}
282-
return &policyMetrics{
291+
return &policyTickMetrics{
283292
tracesSampledByPolicyDecision: tracesSampledByPolicyDecision,
293+
cumulativeExecutionTime: make([]perPolicyExecutionTime, numPolicies),
284294
}
285295
}
286296

287-
func (m *policyMetrics) addDecision(policyIndex int, decision samplingpolicy.Decision, spansSampled int64) {
297+
func (m *policyTickMetrics) addDecision(policyIndex int, decision samplingpolicy.Decision, spansSampled int64) {
288298
stats := m.tracesSampledByPolicyDecision[policyIndex][decision]
289299
stats.tracesSampled++
290300
stats.spansSampled += spansSampled
291301
m.tracesSampledByPolicyDecision[policyIndex][decision] = stats
292302
}
293303

304+
func (m *policyTickMetrics) addDecisionTime(policyIndex int, decisionTime time.Duration) {
305+
perPolicyExecutionTime := m.cumulativeExecutionTime[policyIndex]
306+
perPolicyExecutionTime.executionTime += decisionTime
307+
perPolicyExecutionTime.executionCount++
308+
m.cumulativeExecutionTime[policyIndex] = perPolicyExecutionTime
309+
}
310+
294311
func (tsp *tailSamplingSpanProcessor) loadSamplingPolicy(cfgs []PolicyCfg) error {
295312
telemetrySettings := tsp.set.TelemetrySettings
296313
componentID := tsp.set.ID.Name()
@@ -380,8 +397,9 @@ func (tsp *tailSamplingSpanProcessor) samplingPolicyOnTick() {
380397
}
381398

382399
ctx := context.Background()
383-
metrics := newPolicyMetrics(len(tsp.policies))
400+
metrics := newPolicyTickMetrics(len(tsp.policies))
384401
startTime := time.Now()
402+
globalTracesSampledByDecision := make(map[samplingpolicy.Decision]int64)
385403

386404
batch, _ := tsp.decisionBatcher.CloseCurrentAndTakeFirstBatch()
387405
batchLen := len(batch)
@@ -396,8 +414,7 @@ func (tsp *tailSamplingSpanProcessor) samplingPolicyOnTick() {
396414
trace.DecisionTime = time.Now()
397415

398416
decision := tsp.makeDecision(id, trace, metrics)
399-
400-
tsp.telemetry.ProcessorTailSamplingGlobalCountTracesSampled.Add(tsp.ctx, 1, decisionToAttributes[decision])
417+
globalTracesSampledByDecision[decision]++
401418

402419
// Sampled or not, remove the batches
403420
trace.Lock()
@@ -413,18 +430,24 @@ func (tsp *tailSamplingSpanProcessor) samplingPolicyOnTick() {
413430
}
414431
}
415432

416-
tsp.telemetry.ProcessorTailSamplingSamplingDecisionTimerLatency.Record(tsp.ctx, int64(time.Since(startTime)/time.Millisecond))
433+
tsp.telemetry.ProcessorTailSamplingSamplingDecisionTimerLatency.Record(tsp.ctx, time.Since(startTime).Milliseconds())
417434
tsp.telemetry.ProcessorTailSamplingSamplingTracesOnMemory.Record(tsp.ctx, int64(tsp.numTracesOnMap.Load()))
418435
tsp.telemetry.ProcessorTailSamplingSamplingTraceDroppedTooEarly.Add(tsp.ctx, metrics.idNotFoundOnMapCount)
419436
tsp.telemetry.ProcessorTailSamplingSamplingPolicyEvaluationError.Add(tsp.ctx, metrics.evaluateErrorCount)
420437

438+
for decision, count := range globalTracesSampledByDecision {
439+
tsp.telemetry.ProcessorTailSamplingGlobalCountTracesSampled.Add(tsp.ctx, count, decisionToAttributes[decision])
440+
}
441+
421442
for i, p := range tsp.policies {
422443
for decision, stats := range metrics.tracesSampledByPolicyDecision[i] {
423444
tsp.telemetry.ProcessorTailSamplingCountTracesSampled.Add(tsp.ctx, int64(stats.tracesSampled), p.attribute, decisionToAttributes[decision])
424445
if telemetry.IsMetricStatCountSpansSampledEnabled() {
425446
tsp.telemetry.ProcessorTailSamplingCountSpansSampled.Add(tsp.ctx, stats.spansSampled, p.attribute, decisionToAttributes[decision])
426447
}
427448
}
449+
tsp.telemetry.ProcessorTailSamplingSamplingPolicyExecutionTimeSum.Add(tsp.ctx, metrics.cumulativeExecutionTime[i].executionTime.Microseconds(), p.attribute)
450+
tsp.telemetry.ProcessorTailSamplingSamplingPolicyExecutionCount.Add(tsp.ctx, metrics.cumulativeExecutionTime[i].executionCount, p.attribute)
428451
}
429452

430453
tsp.logger.Debug("Sampling policy evaluation completed",
@@ -437,7 +460,7 @@ func (tsp *tailSamplingSpanProcessor) samplingPolicyOnTick() {
437460
)
438461
}
439462

440-
func (tsp *tailSamplingSpanProcessor) makeDecision(id pcommon.TraceID, trace *samplingpolicy.TraceData, metrics *policyMetrics) samplingpolicy.Decision {
463+
func (tsp *tailSamplingSpanProcessor) makeDecision(id pcommon.TraceID, trace *samplingpolicy.TraceData, metrics *policyTickMetrics) samplingpolicy.Decision {
441464
finalDecision := samplingpolicy.NotSampled
442465
samplingDecisions := map[samplingpolicy.Decision]*policy{
443466
samplingpolicy.Error: nil,
@@ -449,13 +472,12 @@ func (tsp *tailSamplingSpanProcessor) makeDecision(id pcommon.TraceID, trace *sa
449472
}
450473

451474
ctx := context.Background()
452-
startTime := time.Now()
453475

454476
// Check all policies before making a final decision.
455477
for i, p := range tsp.policies {
478+
startTime := time.Now()
456479
decision, err := p.evaluator.Evaluate(ctx, id, trace)
457-
latency := time.Since(startTime)
458-
tsp.telemetry.ProcessorTailSamplingSamplingDecisionLatency.Record(ctx, int64(latency/time.Microsecond), p.attribute)
480+
metrics.addDecisionTime(i, time.Since(startTime))
459481

460482
if err != nil {
461483
if samplingDecisions[samplingpolicy.Error] == nil {

0 commit comments

Comments
 (0)