Skip to content

feat(metrics): dual-emit timer metrics as histograms (final batch)#8018

Open
neil-xie wants to merge 9 commits intocadence-workflow:masterfrom
neil-xie:metrics_migration_batch7
Open

feat(metrics): dual-emit timer metrics as histograms (final batch)#8018
neil-xie wants to merge 9 commits intocadence-workflow:masterfrom
neil-xie:metrics_migration_batch7

Conversation

@neil-xie
Copy link
Copy Markdown
Member

What changed?
Adds histogram variants for 26 timer metrics across five subsystems, dual-emitting both the existing RecordTimer/StartTimer and new RecordHistogramDuration/IntExponentialHistogram calls. The histogram metric names are registered
in HistogramMigrationMetrics so the system defaults to timer-only until histograms are enabled per metric via dynamic config.

Shard info (service/history/shard/context.go, queue/processor_base.go, queuev2/queue_base.go):

  • Pending task counts (6): replication, transfer active/standby, timer active/standby, cross-cluster → IntExponentialHistogram with Mid1To16k buckets
  • Lag metrics: replication lag (ack-level delta) and transfer lag (ack-level delta) → _counts / Mid1To16k; timer lag (duration) → _ns / Low1ms100s
  • Diff metrics: transfer diff (ack-level delta) → _counts; timer diff (duration) → _ns
  • Failover in-progress counts (2) → Mid1To16k; failover latency (2) → Low1ms100s

History archiver (common/archiver/gcloud/historyArchiver.go, s3store/historyArchiver.go):

  • Upload size and history size → IntExponentialHistogram with Mid8B16MB buckets; int64→int casts capped at math.MaxInt
  • Blob size (S3) → same

Domain cache (common/cache/domainCache.go):

  • Prepare callbacks latency and callbacks latency → RecordHistogramDuration with Low1ms100s

Event notification (service/history/events/notifier.go):

  • Queueing latency and fanout latency → RecordHistogramDuration

Direct query dispatch (service/history/engine/engineimpl/query_workflow.go):

  • 4 dispatch path latencies (overall, sticky, non-sticky, clear-stickiness) → RecordHistogramDuration

Why?
This is batch 7 of the ongoing timer→histogram migration (#7741)

How did you test it?
go test ./common/metrics/... ./common/task/... ./service/history/execution/... ./service/history/shard/... ./service/history/workflowcache/...

Potential risks
Low. All changes are additive (dual-emit). New histogram definitions are validated at init() time.

Release notes
N/A — internal observability improvement, no behavior change.

Documentation Changes
This is batch 6 of the ongoing timer→histogram migration (#7741)

Signed-off-by: Neil Xie <neil.xie@uber.com>
Comment thread service/matching/tasklist/matcher.go Outdated
@neil-xie neil-xie force-pushed the metrics_migration_batch7 branch from 4002825 to ff18fa1 Compare April 27, 2026 21:30
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
Comment thread service/worker/diagnostics/workflow.go Outdated
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
@neil-xie neil-xie changed the title feat(metrics): dual-emit timer metrics as histograms (batch 7) feat(metrics): dual-emit timer metrics as histograms (final batch) Apr 29, 2026
Comment thread common/metrics/defs.go
@@ -3237,6 +3268,7 @@ var MetricDefs = map[ServiceIdx]map[MetricIdx]metricDefinition{
CadenceRequests: {metricName: "cadence_requests", metricType: Counter},
CadenceFailures: {metricName: "cadence_errors", metricType: Counter},
CadenceLatency: {metricName: "cadence_latency", metricType: Timer},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 100s enough?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread common/metrics/defs.go Outdated
Comment thread common/metrics/scope.go
if m.migrationConfig.Histogram.EmitHistogram(def.metricRollupName.String()) {
m.rootScope.Histogram(def.metricRollupName.String(), m.getBuckets(id)).RecordDuration(value)
}
case m.isDomainTagged:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codex detected a bug:

several of the new histogram emits still lose the aggregate domain=all series once the timer side is turned off. RecordTimer explicitly dual-emits for domain-tagged scopes, but RecordHistogramDuration only emits a
     second series when metricRollupName is set, so plain domain-tagged histogram defs do not preserve the old timer behavior. You can see the mismatch in common/metrics/scope.go:102 and common/metrics/scope.go:121. The affected
     new callsites I found are service/frontend/templates/metered.tmpl:98 for CadenceLatencyHistogram whose def has no rollup at common/metrics/defs.go:3271, service/frontend/wrappers/accesscontrolled/access_controlled.go:50 for
     CadenceAuthorizationLatencyHistogram with no rollup at common/metrics/defs.go:3359, service/history/failover/coordinator.go:324 for GracefulFailoverLatencyHistogram with no rollup at common/metrics/defs.go:3425, and service/
     worker/diagnostics/workflow.go:111 for DiagnosticsWorkflowExecutionLatencyHistogram with no rollup at common/metrics/defs.go:4159. In practice, when histogram migration disables the timers, the per-domain histogram survives
     but the aggregate series disappears.

Signed-off-by: Neil Xie <neil.xie@uber.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 29, 2026

CI failed: The test suite failed due to a panic caused by an unexpected extra call to the 'RangeCompleteHistoryTask' mock, likely resulting from the new timer cleanup logic in the metrics migration.

Overview

Tests failed during coverage collection due to a panic triggered by an unmet mock expectation. The issue is localized to the transferQueueProcessor tests.

Failures

Unexpected mock call: 'RangeCompleteHistoryTask' (confidence: high)

  • Type: test
  • Affected jobs: 73667154268
  • Related to change: yes
  • Root cause: The new metrics-related cleanup logic triggers an additional call to RangeCompleteHistoryTask that was not anticipated in the existing mock configuration.
  • Suggested fix: Update the affected test case to account for the additional mock call, either by using .Times(n) on the expectation or adding an additional Mock.On(...) call to handle the incremented invocation count.

Summary

  • Change-related failures: 1 (test panic due to unmocked persistence call)
  • Infrastructure/flaky failures: 0
  • Recommended action: Update the mock expectations in the test suite to match the new behavior of the transferQueueProcessor introduced by the PR.
Code Review ✅ Approved 2 resolved / 2 findings

Dual-emits timer metrics as histograms, correcting indentation errors and resolving potential replay duration inaccuracies by standardizing time tracking. No further issues found.

✅ 2 resolved
Quality: Histogram calls have wrong indentation inside if/case blocks

📄 service/matching/tasklist/matcher.go:349 📄 service/matching/tasklist/matcher.go:381 📄 service/matching/tasklist/matcher.go:396 📄 service/matching/tasklist/matcher.go:444 📄 service/matching/tasklist/matcher.go:473 📄 service/matching/tasklist/matcher.go:506 📄 service/matching/tasklist/matcher.go:527 📄 service/matching/tasklist/matcher.go:569 📄 service/matching/tasklist/matcher.go:596 📄 service/matching/tasklist/matcher.go:617 📄 service/matching/handler/engine.go:455 📄 service/matching/handler/engine.go:532 📄 service/matching/handler/engine.go:1301 📄 service/matching/handler/engine.go:1335
All new RecordHistogramDuration and IntExponentialHistogram calls in matcher.go and engine.go are indented one level less than the surrounding code. While Go doesn't use indentation for scoping (so this compiles and runs correctly), it makes the code misleading — it visually appears the histogram call is outside the if/case block when it's actually inside it. This affects ~15 call sites across two files.

Bug: Mixing workflow.Now() with time.Since() produces wrong durations on replay

📄 service/worker/diagnostics/workflow.go:113-117
In the defer block, diagStart is set via workflow.Now(ctx) (deterministic/replayed time), but the histogram records time.Since(diagStart) which uses time.Now() (wall clock). During workflow replay, workflow.Now(ctx) returns the original execution timestamp (potentially days/weeks ago), so time.Since(diagStart) will produce a huge, meaningless duration that pollutes the histogram.

Note: the pre-existing sw.Stop() (StartTimer) has a similar issue but uses wall-clock consistently (time.Now at start and stop), so it at least measures replay duration. This change made it worse by mixing clock sources.

The original time.Now() was actually more correct for the time.Since pattern. Either revert to time.Now() for diagStart, or guard the histogram emission with a replay check.

Rules ⚠️ 1/2 requirements met

Repository Rules

PR Description Quality Standards: The 'How did you test it?' section provides package paths instead of exact test commands, and the 'Documentation Changes' section re-states the PR purpose instead of addressing documentation needs.
GitHub Issue Linking Requirement: The PR description contains a link to the cadence-workflow issue #7741.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants