Worker Heartbeat: Plumb metrics and migrate to runtime/namespace/client level#1038
Worker Heartbeat: Plumb metrics and migrate to runtime/namespace/client level#1038yuandrew merged 7 commits intotemporalio:masterfrom
Conversation
* worker heartbeat * Address Spencer's comments * wip use client_identity_override as part of key, added test * Refactor almost complete, need to plumb through telemetry to SharedNamespaceWorker * Verified client replacement works, need to update tests and cleanup * formating * clean up * forgot to remove new() now that using builder pattern * Switch to worker_set_key * Replace client test passes, need to write unit tests in worker_registry * cargo test-lint * limit nexus to 1 poller, add tests for worker_registry for heartbeat * PR comments * new test helper * Return error on multi worker register for same namespace and task queue on same client * cargo fmt * Fix registration order, unique task queue for test worker * Remove TEST_Q variable * Missing quotes * CI lint and docker test fix, rename worker_set_key to worker_grouping_key * clippy bug
…eat data (temporalio#1023) * plumb in memory metrics * simplify worker::new(), fix some heartbeat metrics, new test file * CounterImpl, final_heartbeat, more specific metric label dbg_panic msg, counter_with_in_mem and and_then() * Support in-mem metrics when metrics aren't configured * Move sys_info refresh to dedicated thread, use tuner's existing sys info * Format, AtomicCell * Fix unit test * Set dynamic config for WorkerHeartbeatsEnabled and ListWorkersEnabled, remove stale metric previously added * Should not expect heartbeat nexus worker in metrics for non-heartbeating integ test * recv_timeout instead of thread::sleep, use WorkflowService::list_workers directly, WithLabel API improvement * MetricAttributes::NoOp, add mechanism to ignore dupe workers for testing, more tests * More tests, sticky cache miss, plugins * Formatting, fix skip_client_worker_set_check * Cursor found a bug * Lower sleep time, add print for debugging * more prints * use semaphores for worker_heartbeat_failure_metrics * skip_client_worker_set_check for all integ workers * Can't use tokio semaphore in workflow code * use signal to test workflow_slots.last_interval_failure_tasks * Use Notify instead of semaphores, fix test flake * Use eventually() instead of a manual sleep * max_outstanding_workflow_tasks 2
# Conflicts: # client/src/raw.rs # core-c-bridge/src/client.rs # core/src/lib.rs # core/src/worker/client.rs # core/src/worker/mod.rs # tests/common/mod.rs # tests/integ_tests/polling_tests.rs
Sushisource
left a comment
There was a problem hiding this comment.
Approving since this was reviewed separately
Sushisource
left a comment
There was a problem hiding this comment.
Actually, sorry, one thing I want to double check is fixed in this branch before we merge because I just saw it while testing some metrics changes:
I'm seeing:
temporal_long_request_latency_bucket{namespace="default",operation="PollWorkflowTaskQueue",service_name="temporal-core-sdk",task_queue="integ_tester-4ae9d50f7ae94eb5be295f8086003b03",le="2500"} 1
In metrics output, which is I believe the worker heartbeating task queue name, but, it should not be making any poll workflow task calls. Just want to double-check that's fixed in this branch and we have a test for it.
Sushisource
left a comment
There was a problem hiding this comment.
Nevermind that. That's the sticky queue name and I just forgot that was the convention it followed.
| heartbeat_map, | ||
| namespace, | ||
| cancel, | ||
| }) |
There was a problem hiding this comment.
Bug: Heartbeat Failure Due to Error Suppression
The SharedNamespaceWorker::new function can fail, but its error isn't propagated. This leads to a non-functional heartbeat mechanism and an inconsistent ClientWorkerRegistrator state, where heartbeat capability is indicated without a registered callback.
What was changed
Primarily a combination of #983 and #1023
This also turns on heartbeating by default
Why?
Finish and enable new worker heartbeating feature.
Checklist
Closes
How was this tested:
Added unit and integration tests
Note
Enable runtime-configured worker heartbeating, replace SlotManager with ClientWorkerSet, plumb in‑memory metrics and poller timing, and update client/worker/C-bridge APIs with comprehensive tests.
heartbeat_interval; send periodic heartbeats and final shutdown heartbeat withWorkerStatus.RuntimeOptions{ telemetry_options, heartbeat_interval }and builders; updateCoreRuntime::new*to use it.worker_instance_key(UUID), status, and improvedreplace_client(re-registers, returns Result).max_cached_workflows; remove globalTEST_Qin tests.SlotManagerwithClientWorkerSet(registration, grouping key, shared namespace worker for heartbeats).ClientWorker/SharedNamespaceWorkerTraitand heartbeat callback wiring.WorkerClientAPI:identity()(wasget_identity),workers()returnsClientWorkerSet,worker_grouping_key(),record_worker_heartbeat(namespace, Vec<WorkerHeartbeat>),shutdown_worker(..., final_heartbeat).HeartbeatMetricType) andWorkerHeartbeatMetrics; extendCoreMeterto create instruments with in-memory mirrors.Worker::worker_instance_key(); addPluginInfo; exportuuiddep.TemporalCoreRuntimeOptionsgainsworker_heartbeat_duration_millis.temporal_core_worker_replace_clientreturns error string on failure.ClientWorker,ClientWorkerSet,HeartbeatCallback,SharedNamespaceWorkerTraitand worker-group key accessors.Written by Cursor Bugbot for commit c30e25e. This will update automatically on new commits. Configure here.