feat(metrics): dual-emit timer metrics as histograms (final batch)#8018
feat(metrics): dual-emit timer metrics as histograms (final batch)#8018neil-xie wants to merge 9 commits intocadence-workflow:masterfrom
Conversation
Signed-off-by: Neil Xie <neil.xie@uber.com>
4002825 to
ff18fa1
Compare
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
| @@ -3237,6 +3268,7 @@ var MetricDefs = map[ServiceIdx]map[MetricIdx]metricDefinition{ | |||
| CadenceRequests: {metricName: "cadence_requests", metricType: Counter}, | |||
| CadenceFailures: {metricName: "cadence_errors", metricType: Counter}, | |||
| CadenceLatency: {metricName: "cadence_latency", metricType: Timer}, | |||
There was a problem hiding this comment.
http://cadence_lasg.uberinternal.com/search?q=repo:%5Ecode%5C.uber%5C.internal/uber-code/go-code%24+file:%5Esrc/code%5C.uber%5C.internal/cadence+cadence-latency&patternType=keyword&sm=0 I don't know if this is actually being used, we use the client metrics in our dashboard
There was a problem hiding this comment.
https://github.com/cadence-workflow/cadence-go-client/pull/1489/changes#diff-34cd3e548ec04053f2529b12d123197eb86c86b88e517b8fa301c7b81b3bccda I think this is per operation, it should be good, matched with client
| if m.migrationConfig.Histogram.EmitHistogram(def.metricRollupName.String()) { | ||
| m.rootScope.Histogram(def.metricRollupName.String(), m.getBuckets(id)).RecordDuration(value) | ||
| } | ||
| case m.isDomainTagged: |
There was a problem hiding this comment.
codex detected a bug:
several of the new histogram emits still lose the aggregate domain=all series once the timer side is turned off. RecordTimer explicitly dual-emits for domain-tagged scopes, but RecordHistogramDuration only emits a
second series when metricRollupName is set, so plain domain-tagged histogram defs do not preserve the old timer behavior. You can see the mismatch in common/metrics/scope.go:102 and common/metrics/scope.go:121. The affected
new callsites I found are service/frontend/templates/metered.tmpl:98 for CadenceLatencyHistogram whose def has no rollup at common/metrics/defs.go:3271, service/frontend/wrappers/accesscontrolled/access_controlled.go:50 for
CadenceAuthorizationLatencyHistogram with no rollup at common/metrics/defs.go:3359, service/history/failover/coordinator.go:324 for GracefulFailoverLatencyHistogram with no rollup at common/metrics/defs.go:3425, and service/
worker/diagnostics/workflow.go:111 for DiagnosticsWorkflowExecutionLatencyHistogram with no rollup at common/metrics/defs.go:4159. In practice, when histogram migration disables the timers, the per-domain histogram survives
but the aggregate series disappears.
Signed-off-by: Neil Xie <neil.xie@uber.com>
CI failed: The test suite failed due to a panic caused by an unexpected extra call to the 'RangeCompleteHistoryTask' mock, likely resulting from the new timer cleanup logic in the metrics migration.OverviewTests failed during coverage collection due to a panic triggered by an unmet mock expectation. The issue is localized to the FailuresUnexpected mock call: 'RangeCompleteHistoryTask' (confidence: high)
Summary
Code Review ✅ Approved 2 resolved / 2 findingsDual-emits timer metrics as histograms, correcting indentation errors and resolving potential replay duration inaccuracies by standardizing time tracking. No further issues found. ✅ 2 resolved✅ Quality: Histogram calls have wrong indentation inside if/case blocks
✅ Bug: Mixing workflow.Now() with time.Since() produces wrong durations on replay
Rules
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
What changed?
Adds histogram variants for 26 timer metrics across five subsystems, dual-emitting both the existing RecordTimer/StartTimer and new RecordHistogramDuration/IntExponentialHistogram calls. The histogram metric names are registered
in HistogramMigrationMetrics so the system defaults to timer-only until histograms are enabled per metric via dynamic config.
Shard info (service/history/shard/context.go, queue/processor_base.go, queuev2/queue_base.go):
History archiver (common/archiver/gcloud/historyArchiver.go, s3store/historyArchiver.go):
Domain cache (common/cache/domainCache.go):
Event notification (service/history/events/notifier.go):
Direct query dispatch (service/history/engine/engineimpl/query_workflow.go):
Why?
This is batch 7 of the ongoing timer→histogram migration (#7741)
How did you test it?
go test ./common/metrics/... ./common/task/... ./service/history/execution/... ./service/history/shard/... ./service/history/workflowcache/...
Potential risks
Low. All changes are additive (dual-emit). New histogram definitions are validated at init() time.
Release notes
N/A — internal observability improvement, no behavior change.
Documentation Changes
This is batch 6 of the ongoing timer→histogram migration (#7741)