Skip to content

[iris] Remove old monolithic heartbeat path (Phase 4)#4843

Open
rjpower wants to merge 5 commits intorjpower/heartbeat-refactorfrom
rjpower/heartbeat-refactor-phase4
Open

[iris] Remove old monolithic heartbeat path (Phase 4)#4843
rjpower wants to merge 5 commits intorjpower/heartbeat-refactorfrom
rjpower/heartbeat-refactor-phase4

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 16, 2026

🤖 Phase 4 of the heartbeat refactor, split into its own PR for ease of review.

Stacked on top of #4638 (Phases 0–3). Merge that first.

Summary

Split heartbeat mode is now the only path for `WorkerProvider`. Removed:

  • `Controller._run_provider_loop`, `_sync_all_execution_units`, `_buffer_assignments`
  • `ControllerTransitions.drain_dispatch_all`, `buffer_kill`, `fail_heartbeats_batch`
  • `WorkerProvider.sync`, `_heartbeat_one`, `_apply_request_from_response`, `_heartbeat_rpc_context`
  • `Worker.handle_heartbeat` (old monolithic handler)
  • `WorkerService.Heartbeat` RPC (proto + generated stubs)
  • 19 old heartbeat-specific tests
  • `use_split_heartbeat` config flag (split mode is now mandatory)

Kept the `dispatch_queue` table — still used by the K8s direct provider.

Net: -1697 lines of old heartbeat machinery.

Test plan

  • `uv run pytest lib/iris/tests/cluster/controller lib/iris/tests/cluster/worker` — 836 passed
  • `./infra/pre-commit.py --fix` — clean
  • Canary ferry on a dev cluster

rjpower added 5 commits April 16, 2026 19:58
Add Ping, StartTasks, StopTasks, PollTasks to WorkerService and
UpdateTaskStatus to ControllerService. Worker handlers extracted from
handle_heartbeat(). Existing heartbeat unchanged — zero behavior change.
When use_split_heartbeat=True, run additional Ping and Poll loops
alongside the existing provider loop. Ping loop detects worker failures
faster (10s interval, 10 failure threshold). Poll loop reconciles
task state every 30s as a safety net. Existing provider loop unchanged.
…ase 2)

When use_split_heartbeat=True, scheduling sends StartTasks RPCs directly
after committing assignments (no dispatch_queue intermediary). Kills send
StopTasks directly. Task updater thread drains a queue every 1s for
batched state transitions. Provider loop replaced by ping + poll + updater.
Workers now push task state transitions to the controller via
UpdateTaskStatus RPC in batches. TaskStateReporter thread on the
worker watches TaskAttempt state changes via on_state_change callback
and enqueues updates with bounded queue + exponential backoff retry.
Controller handler applies updates through existing transitions path.
Split heartbeat mode is now the only path for WorkerProvider. Removed:
- _run_provider_loop, _sync_all_execution_units, _buffer_assignments
- drain_dispatch_all, buffer_kill, fail_heartbeats_batch
- WorkerProvider.sync, _heartbeat_one
- Worker.handle_heartbeat (old monolithic handler)
- Heartbeat RPC from WorkerService
- 19 old heartbeat-specific tests

Kept dispatch_queue table (still used by K8s direct provider).
-1790 lines of old heartbeat machinery.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 16, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

Claude finished @rjpower's task in 11m 35s —— View job


Code review

Reviewed Phase 4 of the heartbeat refactor. Left 2 inline comments on changed lines and flagged 3 additional findings below that couldn't be anchored inline (lines weren't touched by the diff).

Findings

1. Behavioral regression: K8s direct kills no longer wake _run_direct_provider_loop promptly.
Inline: controller.py:2058-2065. Pre-PR, _stop_tasks called self._heartbeat_event.set() after buffer_direct_kill. The PR removed every .set() call, but the loop still blocks on _heartbeat_event.wait(timeout=limiter.time_until_next()). With heartbeat_interval at 5 s, buffered K8s kills sit up to ~5 s before firing. Shutdown is also slower since .wait() no longer listens to stop_event. The construction comment at L1106 is now stale.

2. assert downgraded to silent early return.
Inline: controller.py:2132-2133. _stop_tasks_direct was assert isinstance(self._provider, WorkerProvider); now it silently returns. Same pattern added in _dispatch_assignments_direct at L2079-2080, where queue_assignments has already committed task-attempt rows before the silent return. Violates AGENTS.md: "Assert liberally; prefer raise ValueError over silent fallbacks."

3. Broken call to deleted method in dev benchmark script.
lib/iris/scripts/benchmark_db_queries.py:540-543 still calls transitions.drain_dispatch_all(), which this PR deleted. Running benchmark_db_queries.py --only heartbeat will crash with AttributeError. Not in CI, so not a ship-blocker, but will silently rot. Recommend deleting lines 539-543. Fix this →

4. Dead code surviving from the removed monolithic path.

Per AGENTS.md: "Delete dead code: unused parameters, stale options, old experiments." Fix this →

5. Stale comments referencing removed symbols.

Per AGENTS.md: "Delete stale comments immediately on discovery."


Also confirmed (not issues): the new is_reservation_holder = 0 JOIN filter in get_running_tasks_for_poll correctly preserves the pre-PR behavior of excluding holder tasks — and is covered by the new test_holder_tasks_excluded_from_poll_running_tasks test. The fail_workers_batch reimplementation is semantically equivalent to the old fail_heartbeats_batch(force_remove=True) path.
• Branch: rjpower/heartbeat-refactor-phase4

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if self._heartbeat_thread:
self._heartbeat_thread.stop()
self._heartbeat_thread.join(timeout=join_timeout)

P1 Badge Signal heartbeat event before joining provider loop

In the K8s direct-provider path, _run_direct_provider_loop() waits on _heartbeat_event with a timeout of heartbeat_interval, but stop() now joins _heartbeat_thread without waking that event first. If heartbeat_interval is configured above the 5s join timeout, shutdown can continue while the provider loop is still sleeping, and it may later wake and call _sync_direct_provider() after DB/provider teardown. Please wake _heartbeat_event (or make the loop wait on stop_event) before joining.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 2063 to 2065
for task_id in task_ids:
if task_id not in mapping:
self._transitions.buffer_direct_kill(task_id.to_wire())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wake direct-provider loop after buffering kill requests

The K8s branch in kill_tasks_on_workers() now only calls buffer_direct_kill() and never signals _heartbeat_event. Because this commit also removed all other _heartbeat_event.set() calls, buffered direct kills are no longer dispatched immediately and will wait until the next periodic provider sync tick, which can noticeably delay cancellation/preemption when heartbeat_interval is large.

Useful? React with 👍 / 👎.

Comment on lines +2058 to 2065
elif isinstance(self._provider, K8sTaskProvider):
mapping = dict(task_kill_workers or {})
unresolved = task_ids - set(mapping.keys())
if unresolved:
mapping.update(_task_worker_mapping(self._db, unresolved))
for task_id in task_ids:
if task_id not in mapping:
self._transitions.buffer_direct_kill(task_id.to_wire())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral regression: K8s direct kills no longer wake _run_direct_provider_loop promptly.

Before this PR, _stop_tasks called self._heartbeat_event.set() after buffer_direct_kill to flush buffered K8s kills immediately. The PR removed every _heartbeat_event.set() call, but _run_direct_provider_loop still blocks on self._heartbeat_event.wait(timeout=limiter.time_until_next()) at L1420. With heartbeat_interval defaulting to 5 s, buffered kills now sit for up to ~5 s before the next scheduled tick flushes them, where previously they fired immediately.

Two side-effects to note:

Either restore the wake after buffering direct kills:

elif isinstance(self._provider, K8sTaskProvider):
    mapping = dict(task_kill_workers or {})
    unresolved = task_ids - set(mapping.keys())
    if unresolved:
        mapping.update(_task_worker_mapping(self._db, unresolved))
    for task_id in task_ids:
        if task_id not in mapping:
            self._transitions.buffer_direct_kill(task_id.to_wire())
    self._heartbeat_event.set()

…or drop _heartbeat_event entirely, switch the loop to stop_event.wait(timeout=...) (which at least fixes shutdown latency), and accept up to heartbeat_interval flush latency as intentional.

Comment on lines +2132 to +2133
if not isinstance(self._provider, WorkerProvider):
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-PR version asserted isinstance(self._provider, WorkerProvider); this downgrades to a silent early return. Per AGENTS.md: "Assert liberally; prefer raise ValueError over silent fallbacks." If a non-WorkerProvider ever reaches this method, an affected batch of task kills now vanishes with no signal instead of crashing loudly.

The same pattern was added in _dispatch_assignments_direct at L2079-2080 — and is more dangerous there because queue_assignments has already committed task-attempt rows before the silent return, potentially leaving orphaned attempts.

Suggested change
if not isinstance(self._provider, WorkerProvider):
return
assert isinstance(self._provider, WorkerProvider)

@rjpower rjpower force-pushed the rjpower/heartbeat-refactor branch from cf2be86 to 6267bdd Compare April 16, 2026 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant