[iris] Replace monolithic heartbeat with focused Ping/StartTasks/StopTasks/PollTasks RPCs#4638
[iris] Replace monolithic heartbeat with focused Ping/StartTasks/StopTasks/PollTasks RPCs#4638
Conversation
Add Ping, StartTasks, StopTasks, PollTasks to WorkerService and UpdateTaskStatus to ControllerService. Worker handlers extracted from handle_heartbeat(). Existing heartbeat unchanged — zero behavior change.
When use_split_heartbeat=True, run additional Ping and Poll loops alongside the existing provider loop. Ping loop detects worker failures faster (10s interval, 10 failure threshold). Poll loop reconciles task state every 30s as a safety net. Existing provider loop unchanged.
…ase 2) When use_split_heartbeat=True, scheduling sends StartTasks RPCs directly after committing assignments (no dispatch_queue intermediary). Kills send StopTasks directly. Task updater thread drains a queue every 1s for batched state transitions. Provider loop replaced by ping + poll + updater.
Workers now push task state transitions to the controller via UpdateTaskStatus RPC in batches. TaskStateReporter thread on the worker watches TaskAttempt state changes via on_state_change callback and enqueues updates with bounded queue + exponential backoff retry. Controller handler applies updates through existing transitions path.
Split heartbeat mode is now the only path for WorkerProvider. Removed: - _run_provider_loop, _sync_all_execution_units, _buffer_assignments - drain_dispatch_all, buffer_kill, fail_heartbeats_batch - WorkerProvider.sync, _heartbeat_one - Worker.handle_heartbeat (old monolithic handler) - Heartbeat RPC from WorkerService - 19 old heartbeat-specific tests Kept dispatch_queue table (still used by K8s direct provider). -1790 lines of old heartbeat machinery.
|
Specification for this >500 LOC change: PROBLEM The controller multiplexed worker liveness, task dispatch, task state collection, APPROACH Five new RPCs replace the monolithic heartbeat:
Four controller threads replace the single provider loop:
The dispatch_queue table is no longer written to (kept for K8s provider). KEY CODE Assignment dispatch (controller.py _dispatch_assignments_direct):
Worker push (worker.py TaskStateReporter):
Ping loop (controller.py _run_ping_loop):
TESTS 1861 unit tests pass. 19 old heartbeat-specific tests removed (tested removed |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7d9b010b5c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| def _stop_state_reporter(self) -> None: | ||
| """Stop the TaskStateReporter thread.""" | ||
| self._state_reporter = None |
There was a problem hiding this comment.
Stop reporter thread between worker lifecycle iterations
_start_state_reporter() creates a new managed thread on every register/serve cycle, but _stop_state_reporter() only nulls the reference and never stops that thread. When _serve() exits on heartbeat timeout and the worker re-registers, the previous reporter keeps running, so reconnect flaps accumulate background threads and retry loops indefinitely. This can leak resources and eventually destabilize long-lived workers.
Useful? React with 👍 / 👎.
| running, addresses = self._transitions.get_running_tasks_for_poll() | ||
| if not running: | ||
| return |
There was a problem hiding this comment.
Poll every active worker to reconcile unexpected tasks
This polling path only runs for workers that already have controller-known active tasks, so workers with zero expected tasks are skipped entirely. In that case, if a worker is running orphan/unexpected tasks (for example after a failed stop or controller state divergence), it never receives PollTasks(expected=[]), so its handle_poll_tasks cleanup path is never triggered and stray tasks can continue running indefinitely.
Useful? React with 👍 / 👎.
Split the single HeartbeatRequest/HeartbeatResponse RPC into five focused
RPCs: Worker.Ping (liveness), Worker.StartTasks (immediate dispatch),
Worker.StopTasks (direct kill), Worker.PollTasks (reconciliation), and
Controller.UpdateTaskStatus (worker push).
Scheduling sends StartTasks
immediately after committing assignments instead of buffering in
dispatch_queue.