Worker heartbeat: New in-memory metrics mechism, plumb rest of heartbeat data#1023
Conversation
…g, counter_with_in_mem and and_then()
…, remove stale metric previously added
| impl WorkerHeartbeatMetrics { | ||
| pub fn get_metric(&self, name: &str) -> Option<HeartbeatMetricType> { | ||
| match name { | ||
| "sticky_cache_size" => Some(HeartbeatMetricType::Individual( |
There was a problem hiding this comment.
All these names are duplicative of some existing metric name in core/src/telemetry/metrics.rs, which is calling through into this anyway. Rather than having all the fields be pub, I think it'd be safer to make getters for each, individually, that do the same thing this match does.
There was a problem hiding this comment.
that do the same thing this match does.
Not sure I follow this part, happy to make getters so all fields aren't pub, but it sounds like you're suggesting a change to the match itself? We need some way to map the str name to the struct field. Are you suggesting moving this whole fn over to MetricParameters?
…ers directly, WithLabel API improvement
Sushisource
left a comment
There was a problem hiding this comment.
Looks like some of the tests are failing too
398b5ce to
33fd78d
Compare
33fd78d to
f1a3634
Compare
| ..Default::default() | ||
| }), | ||
| worker.register_wf(wf_name.to_string(), move |ctx: WfContext| async move { | ||
| COUNT.store(COUNT.load(Ordering::Relaxed) + 1, Ordering::Relaxed); |
There was a problem hiding this comment.
This is just a slightly broken version of fetch_add 😅
…nt level (#1038) * Runtime/namespace/client wide worker heartbeat (#983) * worker heartbeat * Address Spencer's comments * wip use client_identity_override as part of key, added test * Refactor almost complete, need to plumb through telemetry to SharedNamespaceWorker * Verified client replacement works, need to update tests and cleanup * formating * clean up * forgot to remove new() now that using builder pattern * Switch to worker_set_key * Replace client test passes, need to write unit tests in worker_registry * cargo test-lint * limit nexus to 1 poller, add tests for worker_registry for heartbeat * PR comments * new test helper * Return error on multi worker register for same namespace and task queue on same client * cargo fmt * Fix registration order, unique task queue for test worker * Remove TEST_Q variable * Missing quotes * CI lint and docker test fix, rename worker_set_key to worker_grouping_key * clippy bug * Worker heartbeat: New in-memory metrics mechism, plumb rest of heartbeat data (#1023) * plumb in memory metrics * simplify worker::new(), fix some heartbeat metrics, new test file * CounterImpl, final_heartbeat, more specific metric label dbg_panic msg, counter_with_in_mem and and_then() * Support in-mem metrics when metrics aren't configured * Move sys_info refresh to dedicated thread, use tuner's existing sys info * Format, AtomicCell * Fix unit test * Set dynamic config for WorkerHeartbeatsEnabled and ListWorkersEnabled, remove stale metric previously added * Should not expect heartbeat nexus worker in metrics for non-heartbeating integ test * recv_timeout instead of thread::sleep, use WorkflowService::list_workers directly, WithLabel API improvement * MetricAttributes::NoOp, add mechanism to ignore dupe workers for testing, more tests * More tests, sticky cache miss, plugins * Formatting, fix skip_client_worker_set_check * Cursor found a bug * Lower sleep time, add print for debugging * more prints * use semaphores for worker_heartbeat_failure_metrics * skip_client_worker_set_check for all integ workers * Can't use tokio semaphore in workflow code * use signal to test workflow_slots.last_interval_failure_tasks * Use Notify instead of semaphores, fix test flake * Use eventually() instead of a manual sleep * max_outstanding_workflow_tasks 2 * merge * Forgot to commit format fixes * Fix test
What was changed
NOTE: targeting
worker-heartbeatfeature branch.New in memory mechanism to keep track of certain metrics for worker heartbeating.
Why?
Checklist
Closes
How was this tested:
Note
Adds in‑memory metric tracking and plumbs full worker heartbeat data (pollers, slots, cpu/mem, plugins), plus new DescribeWorker/SetWorkerDeploymentManager APIs and supporting refactors.
SlotSupplierKind(Fixed/ResourceBased/Custom) and report per-slot supplier kind.PollScaler/LongPollBufferand feed into heartbeats.worker_instance_key()toWorkerAPI; export via heartbeats and tests.skip_client_worker_set_check).HeartbeatMetricType,WorkerHeartbeatMetrics) +*_with_in_memoryconstructors forCounter/Gauge/HistogramDurationand attribute updates.NoOpCoreMeterattributes.RealSysInfonow refreshes on a background thread; exposeSystemResourceInfoviaTunerBuilder; report CPU/mem.HeartbeatCallbacknowArc; remove re-register callback path;register(worker, skip_check)and test updates.DescribeWorker,SetWorkerDeploymentManager; wire through raw client and C-bridge.WorkerConfiggainspluginsandskip_client_worker_set_check.Written by Cursor Bugbot for commit 8f1ef5a. This will update automatically on new commits. Configure here.