Skip to content

Worker heartbeat: New in-memory metrics mechism, plumb rest of heartbeat data#1023

Merged
yuandrew merged 23 commits intotemporalio:worker-heartbeatfrom
yuandrew:worker-heartbeat-telemetry
Oct 17, 2025
Merged

Worker heartbeat: New in-memory metrics mechism, plumb rest of heartbeat data#1023
yuandrew merged 23 commits intotemporalio:worker-heartbeatfrom
yuandrew:worker-heartbeat-telemetry

Conversation

@yuandrew
Copy link
Copy Markdown
Contributor

@yuandrew yuandrew commented Sep 29, 2025

What was changed

NOTE: targeting worker-heartbeat feature branch.

New in memory mechanism to keep track of certain metrics for worker heartbeating.

Why?

Checklist

  1. Closes

  2. How was this tested:

  1. Any docs updates needed?

Note

Adds in‑memory metric tracking and plumbs full worker heartbeat data (pollers, slots, cpu/mem, plugins), plus new DescribeWorker/SetWorkerDeploymentManager APIs and supporting refactors.

  • Worker Heartbeat (core/worker):
    • Capture full heartbeat payload: poller counts (with last-successful-poll timestamps), slot usage (available/used, totals, last-interval deltas), status, start/heartbeat times, host CPU/mem, deployment version, plugins.
    • Introduce SlotSupplierKind (Fixed/ResourceBased/Custom) and report per-slot supplier kind.
    • Track last successful poll times in PollScaler/LongPollBuffer and feed into heartbeats.
    • Add worker_instance_key() to Worker API; export via heartbeats and tests.
    • Support skipping duplicate client worker checks (skip_client_worker_set_check).
  • Telemetry Metrics (core-api/core):
    • New in-memory metrics layer (HeartbeatMetricType, WorkerHeartbeatMetrics) + *_with_in_memory constructors for Counter/Gauge/HistogramDuration and attribute updates.
    • Wire key SDK metrics to in-memory counters/gauges/histograms for heartbeat reporting; enhance NoOpCoreMeter attributes.
  • Resource/Tuner:
    • RealSysInfo now refreshes on a background thread; expose SystemResourceInfo via TunerBuilder; report CPU/mem.
  • Client/Registry:
    • HeartbeatCallback now Arc; remove re-register callback path; register(worker, skip_check) and test updates.
    • Client fills heartbeat client fields and computes last-interval slot deltas.
  • APIs and Protos:
    • Add RPCs: DescribeWorker, SetWorkerDeploymentManager; wire through raw client and C-bridge.
    • OpenAPI/Protos updated (new fields: allowNoPollers, managerIdentity, workerHeartbeats capability, cluster failover versions, doc tweaks).
  • Config & Tests:
    • WorkerConfig gains plugins and skip_client_worker_set_check.
    • Extensive new integ tests for heartbeats/metrics; enable feature flags in test runner.
    • Minor fix: comment/typo, UUID dep added.

Written by Cursor Bugbot for commit 8f1ef5a. This will update automatically on new commits. Configure here.

Comment thread core-api/src/telemetry/metrics.rs Outdated
Comment thread core-api/src/telemetry/metrics.rs
Comment thread core-api/src/telemetry/metrics.rs Outdated
Comment thread core-c-bridge/src/metric.rs Outdated
Comment thread core/src/pollers/poll_buffer.rs Outdated
Comment thread core/src/telemetry/metrics.rs Outdated
Comment thread core/src/worker/client.rs Outdated
Comment thread core/src/worker/mod.rs Outdated
Comment thread core/src/worker/mod.rs Outdated
Comment thread core/src/worker/mod.rs Outdated
@yuandrew yuandrew marked this pull request as ready for review October 5, 2025 03:34
@yuandrew yuandrew requested a review from a team as a code owner October 5, 2025 03:34
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Comment thread client/src/lib.rs Outdated
Comment thread core-api/src/telemetry/metrics.rs Outdated
Comment thread core-api/src/telemetry/metrics.rs Outdated
Comment thread core-api/src/telemetry/metrics.rs Outdated
impl WorkerHeartbeatMetrics {
pub fn get_metric(&self, name: &str) -> Option<HeartbeatMetricType> {
match name {
"sticky_cache_size" => Some(HeartbeatMetricType::Individual(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these names are duplicative of some existing metric name in core/src/telemetry/metrics.rs, which is calling through into this anyway. Rather than having all the fields be pub, I think it'd be safer to make getters for each, individually, that do the same thing this match does.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that do the same thing this match does.

Not sure I follow this part, happy to make getters so all fields aren't pub, but it sounds like you're suggesting a change to the match itself? We need some way to map the str name to the struct field. Are you suggesting moving this whole fn over to MetricParameters?

Comment thread core/src/worker/tuner/resource_based.rs
Comment thread core/src/worker/tuner/resource_based.rs Outdated
Comment thread core/src/worker/mod.rs Outdated
Comment thread core/src/worker/mod.rs Outdated
Comment thread tests/integ_tests/worker_heartbeat_tests.rs
Comment thread core/src/worker/mod.rs Outdated
Comment thread client/src/worker_registry/mod.rs Outdated
Copy link
Copy Markdown
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some of the tests are failing too

Comment thread core-api/src/worker.rs
Comment thread core/src/worker/mod.rs
Comment thread core/src/worker/mod.rs
@yuandrew yuandrew force-pushed the worker-heartbeat-telemetry branch from 398b5ce to 33fd78d Compare October 14, 2025 23:24
Comment thread core/src/worker/client.rs Outdated
@yuandrew yuandrew force-pushed the worker-heartbeat-telemetry branch from 33fd78d to f1a3634 Compare October 14, 2025 23:53
Comment thread core/src/worker/client.rs Outdated
Comment thread core/src/worker/client.rs Outdated
Comment thread core/src/worker/client.rs Outdated
Comment thread tests/integ_tests/worker_heartbeat_tests.rs Outdated
..Default::default()
}),
worker.register_wf(wf_name.to_string(), move |ctx: WfContext| async move {
COUNT.store(COUNT.load(Ordering::Relaxed) + 1, Ordering::Relaxed);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a slightly broken version of fetch_add 😅

Copy link
Copy Markdown
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@yuandrew yuandrew merged commit 4bc7383 into temporalio:worker-heartbeat Oct 17, 2025
17 checks passed
yuandrew added a commit that referenced this pull request Oct 20, 2025
…nt level (#1038)

* Runtime/namespace/client wide worker heartbeat (#983)

* worker heartbeat

* Address Spencer's comments

* wip use client_identity_override as part of key, added test

* Refactor almost complete, need to plumb through telemetry to SharedNamespaceWorker

* Verified client replacement works, need to update tests and cleanup

* formating

* clean up

* forgot to remove new() now that using builder pattern

* Switch to worker_set_key

* Replace client test passes, need to write unit tests in worker_registry

* cargo test-lint

* limit nexus to 1 poller, add tests for worker_registry for heartbeat

* PR comments

* new test helper

* Return error on multi worker register for same namespace and task queue on same client

* cargo fmt

* Fix registration order, unique task queue for test worker

* Remove TEST_Q variable

* Missing quotes

* CI lint and docker test fix, rename worker_set_key to worker_grouping_key

* clippy bug

* Worker heartbeat: New in-memory metrics mechism, plumb rest of heartbeat data (#1023)

* plumb in memory metrics

* simplify worker::new(), fix some heartbeat metrics, new test file

* CounterImpl, final_heartbeat, more specific metric label dbg_panic msg, counter_with_in_mem and and_then()

* Support in-mem metrics when metrics aren't configured

* Move sys_info refresh to dedicated thread, use tuner's existing sys info

* Format, AtomicCell

* Fix unit test

* Set dynamic config for WorkerHeartbeatsEnabled and ListWorkersEnabled, remove stale metric previously added

* Should not expect heartbeat nexus worker in metrics for non-heartbeating integ test

* recv_timeout instead of thread::sleep, use WorkflowService::list_workers directly, WithLabel API improvement

* MetricAttributes::NoOp, add mechanism to ignore dupe workers for testing, more tests

* More tests, sticky cache miss, plugins

* Formatting, fix skip_client_worker_set_check

* Cursor found a bug

* Lower sleep time, add print for debugging

* more prints

* use semaphores for worker_heartbeat_failure_metrics

* skip_client_worker_set_check for all integ workers

* Can't use tokio semaphore in workflow code

* use signal to test workflow_slots.last_interval_failure_tasks

* Use Notify instead of semaphores, fix test flake

* Use eventually() instead of a manual sleep

* max_outstanding_workflow_tasks 2

* merge

* Forgot to commit format fixes

* Fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants