feat(metrics): add UpDownCounter metric instrument type with replay-safe semantics#1980
feat(metrics): add UpDownCounter metric instrument type with replay-safe semantics#1980gibbonjj wants to merge 7 commits intotemporalio:mainfrom
Conversation
|
@Sushisource, after merging https://github.com/temporalio/sdk-core/pull/1180...can i get 👀 on this when you get a chance 🙏 . let me know if you agree with this approach and can start implementing the other lang sdks |
|
@gibbonjj Could you please update the |
Add UpDownCounter support across the TypeScript SDK bridge layer: - Rust NAPI bridge (core-bridge/src/metrics.rs): Import CoreUpDownCounter, add UpDownCounter struct, new_metric_up_down_counter and add_metric_up_down_counter_value functions, register in init, handle UpDownCounter in buffered metrics (MetricKind enum, BufferedMetric::new, and MetricUpdateVal::SignedDelta conversion) - Native TS declarations (core-bridge/ts/native.ts): Add MetricUpDownCounter interface, newMetricUpDownCounter and addMetricUpDownCounterValue declarations, add 'up_down_counter' to BufferedMetricKind - Common types (common/src/metrics.ts): Add 'up_down_counter' to MetricKind, add MetricUpDownCounter interface, createUpDownCounter to MetricMeter interface, NoopMetricMeter implementation, MetricUpDownCounterWithComposedTags class, and createUpDownCounter to MetricMeterWithComposedTags Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire the UpDownCounter metric type through three SDK layers: - RuntimeMetricMeter.createUpDownCounter and RuntimeMetricUpDownCounter class - WorkflowMetricMeterImpl.createUpDownCounter, WorkflowMetricUpDownCounter class, and sink interface - Worker-side addMetricUpDownCounterValue sink handler in initMetricSink Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds UpDownCounter metric type to the TypeScript SDK, bridging through NAPI to the Rust Core SDK's UpDownCounter (temporalio/sdk-core#1180). UpDownCounter uses in-process state (Prometheus IntGaugeVec) that doesn't survive process restarts, but Temporal workflows do. To handle this correctly, the workflow sandbox tracks cumulative net values per tag combination and always emits the absolute net value to the worker sink (even during replay). The sink tracks per-workflow contributions and applies only the delta from what it previously knew. On eviction, contributions are undone so the metric reflects only active workflows. This handles all cases correctly: - Normal execution: net value changes, sink applies delta - Cache eviction (same process): replay rebuilds same net value, delta=0, no double-count - Worker restart (new process): replay rebuilds net value from scratch, full delta applied - Workflow completion: eviction cleanup undoes contribution Additional changes: - Singleton enforcement per metric name per workflow (WeakMap on Activator), following OTel conventions - Worker calls cleanupWorkflowMetrics on both eviction paths - 13 unit tests covering all replay scenarios
Includes changes from temporalio/sdk-core#1180. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fbf99af to
94ee387
Compare
Done |
|
Thanks, @gibbonjj. That's certainly a pertinent feature addition. I have to admit however that I'm concerned by how invasive this PR turns out to be. I'm definitely not comfortable with some of the changes I see, notably how the tracking of workflow's emitted counters is implemented. I also wonder if we might need to similarly track counters emitted by activities. To make the review process easier on all of us, I'd suggest we break this effort into 2 or 3 distinct PRs:
1 and 2 can maybe be done together if they contain nothing controversial. Step 3 will be a distinct PR, and will likely need a few rounds to get the design straighten. Does that seem reasonable to you? |
Summary
UpDownCountermetric instrument type to the TypeScript SDK, enabling tracking of inflight/active counts with signed deltasUpDownCounter(i64) — depends on feat(metrics): add UpDownCounter metric instrument type sdk-core#1180Design
UpDownCounter metrics use in-process state (Prometheus
IntGaugeVec) that doesn't survive process restarts, but Temporal workflows do — they replay on a new worker and continue executing. A naivecallDuringReplay: falsecauses the metric to go negative (decrement fires on a new process that never saw the increment). A naivecallDuringReplay: truecauses double-counting on cache eviction (the increment is still in process memory and replay re-increments).The solution uses a net-value tracking approach:
Workflow sandbox:
WorkflowMetricUpDownCountertracks its cumulative net value per tag combination and always emits the absolute net value to the sink — noisReplayingguard. On replay, the sandbox starts fresh and rebuilds the correct net value from the replayedadd()calls.Worker sink:
callDuringReplay: true. Tracks per-workflow contributions in aMap<key, TrackedContribution>. On each call, computesdelta = receivedNetValue - previouslyTrackedNetValueand applies only the delta to the real metric. This makes it idempotent — replaying the same net value produces delta=0.Eviction cleanup: When a workflow is evicted (cache pressure, completion, or shutdown),
cleanupWorkflow(runId)undoes all tracked contributions for that workflow. The metric only reflects workflows actively cached on this worker.Additional design decisions:
WeakMap<Activator>, following OTel conventions (same name = same instrument). Prevents incorrect net-value tracking when multiple instances share a name.maxCachedWorkflows. Cleaned up on eviction.Changes
packages/common/src/metrics.ts):MetricUpDownCounterinterface,MetricKindupdated,NoopMetricMeter,MetricUpDownCounterWithComposedTagspackages/core-bridge/src/metrics.rs):UpDownCounterstruct,new_metric_up_down_counter,add_metric_up_down_counter_value(castsf64 → i64), buffered metricsMetricKind::UpDownCounterpackages/core-bridge/ts/native.ts): TS declarations for bridge functions,BufferedMetricKindupdatedpackages/worker/src/runtime-metrics.ts):RuntimeMetricUpDownCounter— no non-negative check (unlike Counter)packages/workflow/src/metrics.ts):WorkflowMetricUpDownCounterwith net-value tracking, singleton enforcement, always emits (no replay skip)packages/worker/src/workflow/metrics.ts): Per-workflow delta tracking,callDuringReplay: true,cleanupWorkflow()for eviction,MetricSinkWithCleanupreturn typepackages/worker/src/worker-options.ts): WirescleanupWorkflowMetricsthroughCompiledWorkerOptionspackages/worker/src/worker.ts): Calls cleanup on both eviction paths (eviction-only activation and eviction-with-jobs)Usage
Depends on
Test plan
add(1) + add(1) + add(-1)→ Prometheus shows1add(1)before checkpoint,add(-1)after → net0)withTags()compositionkind: 'up_down_counter',valueType: 'int', positive and negative values