[Iris] Remove legacy monolithic Heartbeat RPC#5092
Conversation
Iris has migrated to a split-heartbeat model (Ping + StartTasks + StopTasks + PollTasks). This removes the legacy Heartbeat RPC, the HeartbeatRequest/HeartbeatResponse messages, the use_split_heartbeat config flag, the HeartbeatAction enum, the DispatchBatch dataclass, and all associated controller/worker code paths (begin_heartbeat / complete_heartbeat / fail_heartbeat / drain_dispatch / drain_dispatch_all / record_heartbeat_failure / buffer_dispatch / buffer_kill). Tests rewritten against the new RPC surface where behaviour still applies; tests that asserted now-deleted semantics (blocking tasks_to_kill, provider.sync dry-run spy) are removed. Fixes #4692
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a71f663aea
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| import asyncio | ||
| import logging | ||
| import threading | ||
| from dataclasses import dataclass | ||
| from time import monotonic | ||
| from typing import Protocol |
There was a problem hiding this comment.
Re-map controller heartbeat chaos keys to split RPCs
Removing the legacy heartbeat implementation also removed all call sites for chaos("controller.heartbeat") and chaos("controller.heartbeat.iteration"), so enable_chaos for those keys is now a no-op. The E2E chaos suite still uses these keys (for example in lib/iris/tests/e2e/test_chaos.py at lines 123, 188, 241, and 339) to force controller-side dispatch/poll failures; after this change those scenarios stop exercising failure handling and assertions that depend on induced failures can no longer hold. Please either map these chaos keys onto the new Ping/StartTasks/StopTasks/PollTasks paths or update the keys/tests in the same change.
Useful? React with 👍 / 👎.
| import logging | ||
| import time | ||
| from typing import Protocol | ||
|
|
There was a problem hiding this comment.
Re-map worker heartbeat chaos key to active RPC handlers
The commit removes the only chaos("worker.heartbeat") injection point along with the heartbeat RPC handler, which makes that chaos key inert. Existing E2E coverage still relies on this key (for example lib/iris/tests/e2e/test_chaos.py lines 206, 215, and 233) to simulate transient and permanent worker-side communication faults, so those tests and operational fault-injection workflows no longer validate the intended failure paths. The worker chaos hook should be preserved on an equivalent split RPC path (e.g., Ping/PollTasks) or migrated together with the tests.
Useful? React with 👍 / 👎.
The previous commit removed controller.heartbeat / worker.heartbeat /
controller.heartbeat.iteration chaos keys along with the Heartbeat RPC,
silently turning the e2e chaos suite into dead assertions. Re-add hooks
on the equivalent split RPCs (controller.{ping,start_tasks,stop_tasks,
poll_tasks,poll_iteration}, worker.ping) and point the tests at the new
keys. Threshold-driven tests now target controller.ping, dispatch-fault
tests target controller.start_tasks, and the 128-task race test targets
controller.poll_iteration.
yonromai
left a comment
There was a problem hiding this comment.
Code Change Walkthrough
0. Relevant Context
Context
This PR finishes the cleanup after Iris moved to the split worker lifecycle model: Ping, StartTasks, StopTasks, PollTasks, and worker-pushed UpdateTaskStatus. The removed code is the old monolithic Heartbeat RPC path and the controller/provider plumbing that only existed to support it.
Issue: #4692
1. Problem Statement
The repo still carried a second worker-control path that was already obsolete in practice:
- legacy
Heartbeatrequest/response protos and Connect endpoint - controller
use_split_heartbeatconfig plumbing - worker-provider sync code built around
DispatchBatch - tests and docs that still described the monolithic heartbeat flow
That extra surface area made the state machine harder to reason about and left dead config and protocol paths around.
2. Assumptions And Solution Summary
Assumption: supported Iris clusters already run on the split lifecycle RPCs, so the monolithic heartbeat path is safe to remove rather than preserve behind compatibility flags.
The change deletes the legacy RPC surface, collapses worker-backed controller execution onto the split path only, rewrites the worker/provider/controller implementations around that contract, and updates the remaining docs/tests/examples to match.
3. Design & Architecture
- Worker-backed providers now use direct RPC helpers for
Ping,StartTasks,StopTasks, andPollTasks; the controller no longer maintains a separate workersync()path. - The controller’s non-K8s execution model is now explicit: scheduling loop, ping loop, and batched task-update loop.
- Worker reconciliation stays split:
StartTaskslaunches work,PollTasksreconciles expected tasks, andUpdateTaskStatuspushes state transitions back to the controller. - K8s direct-provider code is mostly unchanged; the PR mainly removes stale shared terminology from that path.
4. Major Changes
- Removed legacy heartbeat protos, generated bindings, and worker service endpoints.
- Removed the
use_split_heartbeatcontroller config flag and its call sites. - Simplified
WorkerProviderand controller transitions by deletingDispatchBatch/drain/requeue legacy machinery. - Rewrote affected tests to exercise the split RPC surface directly.
- Updated Iris docs/examples to stop referring to the monolithic heartbeat path.
5. Testing
cd lib/iris && uv run --group dev python -m pytest -n1 --tb=short tests/cluster/worker/test_worker.py tests/cluster/controller/test_service.pycd lib/iris && uv run --no-sync --group dev python -m pytest -n1 --tb=short tests/cluster/controller/test_transitions.py -k 'task_update_worker_failed_cascades_children or max_failures_kills_direct_provider_tasks'cd lib/iris && uv run --no-sync --group dev python -m pytest -n1 --tb=short tests/cluster/controller/test_reservation.py -k 'holder_task_worker_death_no_failure_record or get_running_tasks_for_poll_excludes_reservation_holders'
Code Review
Comments
No inline comments.
Review Summary
🤖 Approved.
No issues found. Checked for bugs and AGENTS.md compliance.
Validation:
- targeted worker and controller service tests passed
- targeted transition and reservation regression tests passed
Generated with Codex

Iris has migrated to a split-heartbeat model (Ping + StartTasks + StopTasks + PollTasks), so the legacy Heartbeat RPC, HeartbeatRequest/Response, use_split_heartbeat flag, HeartbeatAction enum, DispatchBatch, and all begin/complete/fail_heartbeat + drain_dispatch paths are now dead weight. This removes them and rewrites the few tests that still covered behaviour against the new RPC surface.
Fixes #4692