[iris] Remove old monolithic heartbeat path (Phase 4) by rjpower · Pull Request #4843 · marin-community/marin

rjpower · 2026-04-16T20:53:32Z

🤖 Phase 4 of the heartbeat refactor, split into its own PR for ease of review.

Stacked on top of #4638 (Phases 0–3). Merge that first.

Summary

Split heartbeat mode is now the only path for `WorkerProvider`. Removed:

`Controller._run_provider_loop`, `_sync_all_execution_units`, `_buffer_assignments`
`ControllerTransitions.drain_dispatch_all`, `buffer_kill`, `fail_heartbeats_batch`
`WorkerProvider.sync`, `_heartbeat_one`, `_apply_request_from_response`, `_heartbeat_rpc_context`
`Worker.handle_heartbeat` (old monolithic handler)
`WorkerService.Heartbeat` RPC (proto + generated stubs)
19 old heartbeat-specific tests
`use_split_heartbeat` config flag (split mode is now mandatory)

Kept the `dispatch_queue` table — still used by the K8s direct provider.

Net: -1697 lines of old heartbeat machinery.

Test plan

`uv run pytest lib/iris/tests/cluster/controller lib/iris/tests/cluster/worker` — 836 passed
`./infra/pre-commit.py --fix` — clean
Canary ferry on a dev cluster

Add Ping, StartTasks, StopTasks, PollTasks to WorkerService and UpdateTaskStatus to ControllerService. Worker handlers extracted from handle_heartbeat(). Existing heartbeat unchanged — zero behavior change.

When use_split_heartbeat=True, run additional Ping and Poll loops alongside the existing provider loop. Ping loop detects worker failures faster (10s interval, 10 failure threshold). Poll loop reconciles task state every 30s as a safety net. Existing provider loop unchanged.

…ase 2) When use_split_heartbeat=True, scheduling sends StartTasks RPCs directly after committing assignments (no dispatch_queue intermediary). Kills send StopTasks directly. Task updater thread drains a queue every 1s for batched state transitions. Provider loop replaced by ping + poll + updater.

Workers now push task state transitions to the controller via UpdateTaskStatus RPC in batches. TaskStateReporter thread on the worker watches TaskAttempt state changes via on_state_change callback and enqueues updates with bounded queue + exponential backoff retry. Controller handler applies updates through existing transitions path.

Split heartbeat mode is now the only path for WorkerProvider. Removed: - _run_provider_loop, _sync_all_execution_units, _buffer_assignments - drain_dispatch_all, buffer_kill, fail_heartbeats_batch - WorkerProvider.sync, _heartbeat_one - Worker.handle_heartbeat (old monolithic handler) - Heartbeat RPC from WorkerService - 19 old heartbeat-specific tests Kept dispatch_queue table (still used by K8s direct provider). -1790 lines of old heartbeat machinery.

claude · 2026-04-16T20:53:47Z

Claude finished @rjpower's task in 11m 35s —— View job

Code review

Reviewed Phase 4 of the heartbeat refactor. Left 2 inline comments on changed lines and flagged 3 additional findings below that couldn't be anchored inline (lines weren't touched by the diff).

Findings

1. Behavioral regression: K8s direct kills no longer wake _run_direct_provider_loop promptly.
Inline: controller.py:2058-2065. Pre-PR, _stop_tasks called self._heartbeat_event.set() after buffer_direct_kill. The PR removed every .set() call, but the loop still blocks on _heartbeat_event.wait(timeout=limiter.time_until_next()). With heartbeat_interval at 5 s, buffered K8s kills sit up to ~5 s before firing. Shutdown is also slower since .wait() no longer listens to stop_event. The construction comment at L1106 is now stale.

2. assert downgraded to silent early return.
Inline: controller.py:2132-2133. _stop_tasks_direct was assert isinstance(self._provider, WorkerProvider); now it silently returns. Same pattern added in _dispatch_assignments_direct at L2079-2080, where queue_assignments has already committed task-attempt rows before the silent return. Violates AGENTS.md: "Assert liberally; prefer raise ValueError over silent fallbacks."

3. Broken call to deleted method in dev benchmark script.
lib/iris/scripts/benchmark_db_queries.py:540-543 still calls transitions.drain_dispatch_all(), which this PR deleted. Running benchmark_db_queries.py --only heartbeat will crash with AttributeError. Not in CI, so not a ship-blocker, but will silently rot. Recommend deleting lines 539-543. Fix this →

4. Dead code surviving from the removed monolithic path.

TxResult.has_real_dispatch (field) — written by queue_assignments, never read. Previously consumed by deleted _buffer_assignments.
HeartbeatAction.TRANSIENT_FAILURE and HeartbeatAction.WORKER_FAILED — only OK is now emitted anywhere.
HeartbeatApplyResult.action — always OK, no caller reads it.

Per AGENTS.md: "Delete dead code: unused parameters, stale options, old experiments." Fix this →

5. Stale comments referencing removed symbols.

transitions.py:2615-2617 — # Split Heartbeat Helpers banner (the distinction no longer exists).
transitions.py:266 — DirectProviderBatch docstring mentions deleted DispatchBatch.
providers/k8s/tasks.py:1047 — same DispatchBatch reference.
lib/iris/docs/task-states.md:120 — references deleted complete_heartbeat.
test_transitions.py:2180 — comment mentions deleted fail_heartbeat_for_worker.

Per AGENTS.md: "Delete stale comments immediately on discovery."

Also confirmed (not issues): the new is_reservation_holder = 0 JOIN filter in get_running_tasks_for_poll correctly preserves the pre-PR behavior of excluding holder tasks — and is covered by the new test_holder_tasks_excluded_from_poll_running_tasks test. The fail_workers_batch reimplementation is semantically equivalent to the old fail_heartbeats_batch(force_remove=True) path.
• Branch: rjpower/heartbeat-refactor-phase4

chatgpt-codex-connector

💡 Codex Review

marin/lib/iris/src/iris/cluster/controller/controller.py

Lines 1267 to 1269 in 71efe6f

    
           if self._heartbeat_thread: 
        
               self._heartbeat_thread.stop() 
        
               self._heartbeat_thread.join(timeout=join_timeout)

Signal heartbeat event before joining provider loop

In the K8s direct-provider path, _run_direct_provider_loop() waits on _heartbeat_event with a timeout of heartbeat_interval, but stop() now joins _heartbeat_thread without waking that event first. If heartbeat_interval is configured above the 5s join timeout, shutdown can continue while the provider loop is still sleeping, and it may later wake and call _sync_direct_provider() after DB/provider teardown. Please wake _heartbeat_event (or make the loop wait on stop_event) before joining.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T20:59:32Z

            for task_id in task_ids:
                if task_id not in mapping:
                    self._transitions.buffer_direct_kill(task_id.to_wire())


Wake direct-provider loop after buffering kill requests

The K8s branch in kill_tasks_on_workers() now only calls buffer_direct_kill() and never signals _heartbeat_event. Because this commit also removed all other _heartbeat_event.set() calls, buffered direct kills are no longer dispatched immediately and will wait until the next periodic provider sync tick, which can noticeably delay cancellation/preemption when heartbeat_interval is large.

Useful? React with 👍 / 👎.

claude · 2026-04-16T21:04:20Z

+        elif isinstance(self._provider, K8sTaskProvider):
+            mapping = dict(task_kill_workers or {})
+            unresolved = task_ids - set(mapping.keys())
+            if unresolved:
+                mapping.update(_task_worker_mapping(self._db, unresolved))
            for task_id in task_ids:
                if task_id not in mapping:
                    self._transitions.buffer_direct_kill(task_id.to_wire())


Behavioral regression: K8s direct kills no longer wake _run_direct_provider_loop promptly.

Before this PR, _stop_tasks called self._heartbeat_event.set() after buffer_direct_kill to flush buffered K8s kills immediately. The PR removed every _heartbeat_event.set() call, but _run_direct_provider_loop still blocks on self._heartbeat_event.wait(timeout=limiter.time_until_next()) at L1420. With heartbeat_interval defaulting to 5 s, buffered kills now sit for up to ~5 s before the next scheduled tick flushes them, where previously they fired immediately.

Two side-effects to note:

Shutdown of the direct-provider loop is also delayed, since the wait(...) no longer listens to stop_event and nothing else sets _heartbeat_event.

The construction comment is now stale — # Used by K8s direct provider loop to wake on buffered kills. at L1106 — nothing ever sets the event.

Either restore the wake after buffering direct kills:

elif isinstance(self._provider, K8sTaskProvider): mapping = dict(task_kill_workers or {}) unresolved = task_ids - set(mapping.keys()) if unresolved: mapping.update(_task_worker_mapping(self._db, unresolved)) for task_id in task_ids: if task_id not in mapping: self._transitions.buffer_direct_kill(task_id.to_wire()) self._heartbeat_event.set()

…or drop _heartbeat_event entirely, switch the loop to stop_event.wait(timeout=...) (which at least fixes shutdown latency), and accept up to heartbeat_interval flush latency as intentional.

claude · 2026-04-16T21:04:30Z

+        if not isinstance(self._provider, WorkerProvider):
+            return


The pre-PR version asserted isinstance(self._provider, WorkerProvider); this downgrades to a silent early return. Per AGENTS.md: "Assert liberally; prefer raise ValueError over silent fallbacks." If a non-WorkerProvider ever reaches this method, an affected batch of task kills now vanishes with no signal instead of crashing loudly.

The same pattern was added in _dispatch_assignments_direct at L2079-2080 — and is more dangerous there because queue_assignments has already committed task-attempt rows before the silent return, potentially leaving orphaned attempts.

Suggested change

if not isinstance(self._provider, WorkerProvider):

return

assert isinstance(self._provider, WorkerProvider)

rjpower added 5 commits April 16, 2026 19:58

[iris] Add new focused RPCs for heartbeat refactor (Phase 0)

df56b61

Add Ping, StartTasks, StopTasks, PollTasks to WorkerService and UpdateTaskStatus to ControllerService. Worker handlers extracted from handle_heartbeat(). Existing heartbeat unchanged — zero behavior change.

rjpower added the agent-generated Created by automation/agent label Apr 16, 2026

chatgpt-codex-connector bot reviewed Apr 16, 2026

View reviewed changes

claude bot reviewed Apr 16, 2026

View reviewed changes

rjpower force-pushed the rjpower/heartbeat-refactor branch from cf2be86 to 6267bdd Compare April 16, 2026 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Remove old monolithic heartbeat path (Phase 4)#4843

[iris] Remove old monolithic heartbeat path (Phase 4)#4843
rjpower wants to merge 5 commits intorjpower/heartbeat-refactorfrom
rjpower/heartbeat-refactor-phase4

rjpower commented Apr 16, 2026

Uh oh!

claude bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if self._heartbeat_thread:
	self._heartbeat_thread.stop()
	self._heartbeat_thread.join(timeout=join_timeout)

	if not isinstance(self._provider, WorkerProvider):
	return
	assert isinstance(self._provider, WorkerProvider)

Conversation

rjpower commented Apr 16, 2026

Summary

Test plan

Uh oh!

claude bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Findings

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 16, 2026 •

edited

Loading