Skip to content

[Iris] Remove legacy monolithic Heartbeat RPC#5092

Merged
rjpower merged 2 commits intomainfrom
claude/keen-perlman-26e3ab
Apr 23, 2026
Merged

[Iris] Remove legacy monolithic Heartbeat RPC#5092
rjpower merged 2 commits intomainfrom
claude/keen-perlman-26e3ab

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 22, 2026

Iris has migrated to a split-heartbeat model (Ping + StartTasks + StopTasks + PollTasks), so the legacy Heartbeat RPC, HeartbeatRequest/Response, use_split_heartbeat flag, HeartbeatAction enum, DispatchBatch, and all begin/complete/fail_heartbeat + drain_dispatch paths are now dead weight. This removes them and rewrites the few tests that still covered behaviour against the new RPC surface.

Fixes #4692

Iris has migrated to a split-heartbeat model (Ping + StartTasks +
StopTasks + PollTasks). This removes the legacy Heartbeat RPC, the
HeartbeatRequest/HeartbeatResponse messages, the use_split_heartbeat
config flag, the HeartbeatAction enum, the DispatchBatch dataclass, and
all associated controller/worker code paths (begin_heartbeat /
complete_heartbeat / fail_heartbeat / drain_dispatch /
drain_dispatch_all / record_heartbeat_failure / buffer_dispatch /
buffer_kill).

Tests rewritten against the new RPC surface where behaviour still
applies; tests that asserted now-deleted semantics (blocking
tasks_to_kill, provider.sync dry-run spy) are removed.

Fixes #4692
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Code review in progress

  • Check PR status (not closed/draft/already reviewed)
  • Identify relevant CLAUDE.md/AGENTS.md files
  • Summarize PR changes
  • Run 4 parallel review agents (CLAUDE.md compliance + bug-finding)
  • Validate flagged issues
  • Post inline comments / summary

View job runBranch claude/keen-perlman-26e3ab

@rjpower rjpower requested a review from yonromai April 22, 2026 23:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a71f663aea

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 6 to 10
import asyncio
import logging
import threading
from dataclasses import dataclass
from time import monotonic
from typing import Protocol
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-map controller heartbeat chaos keys to split RPCs

Removing the legacy heartbeat implementation also removed all call sites for chaos("controller.heartbeat") and chaos("controller.heartbeat.iteration"), so enable_chaos for those keys is now a no-op. The E2E chaos suite still uses these keys (for example in lib/iris/tests/e2e/test_chaos.py at lines 123, 188, 241, and 339) to force controller-side dispatch/poll failures; after this change those scenarios stop exercising failure handling and assertions that depend on induced failures can no longer hold. Please either map these chaos keys onto the new Ping/StartTasks/StopTasks/PollTasks paths or update the keys/tests in the same change.

Useful? React with 👍 / 👎.

Comment on lines 6 to 8
import logging
import time
from typing import Protocol

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-map worker heartbeat chaos key to active RPC handlers

The commit removes the only chaos("worker.heartbeat") injection point along with the heartbeat RPC handler, which makes that chaos key inert. Existing E2E coverage still relies on this key (for example lib/iris/tests/e2e/test_chaos.py lines 206, 215, and 233) to simulate transient and permanent worker-side communication faults, so those tests and operational fault-injection workflows no longer validate the intended failure paths. The worker chaos hook should be preserved on an equivalent split RPC path (e.g., Ping/PollTasks) or migrated together with the tests.

Useful? React with 👍 / 👎.

The previous commit removed controller.heartbeat / worker.heartbeat /
controller.heartbeat.iteration chaos keys along with the Heartbeat RPC,
silently turning the e2e chaos suite into dead assertions. Re-add hooks
on the equivalent split RPCs (controller.{ping,start_tasks,stop_tasks,
poll_tasks,poll_iteration}, worker.ping) and point the tests at the new
keys. Threshold-driven tests now target controller.ping, dispatch-fault
tests target controller.start_tasks, and the 128-task race test targets
controller.poll_iteration.
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Change Walkthrough

0. Relevant Context

Context

This PR finishes the cleanup after Iris moved to the split worker lifecycle model: Ping, StartTasks, StopTasks, PollTasks, and worker-pushed UpdateTaskStatus. The removed code is the old monolithic Heartbeat RPC path and the controller/provider plumbing that only existed to support it.

Issue: #4692

1. Problem Statement

The repo still carried a second worker-control path that was already obsolete in practice:

  • legacy Heartbeat request/response protos and Connect endpoint
  • controller use_split_heartbeat config plumbing
  • worker-provider sync code built around DispatchBatch
  • tests and docs that still described the monolithic heartbeat flow

That extra surface area made the state machine harder to reason about and left dead config and protocol paths around.

2. Assumptions And Solution Summary

Assumption: supported Iris clusters already run on the split lifecycle RPCs, so the monolithic heartbeat path is safe to remove rather than preserve behind compatibility flags.

The change deletes the legacy RPC surface, collapses worker-backed controller execution onto the split path only, rewrites the worker/provider/controller implementations around that contract, and updates the remaining docs/tests/examples to match.

3. Design & Architecture

  • Worker-backed providers now use direct RPC helpers for Ping, StartTasks, StopTasks, and PollTasks; the controller no longer maintains a separate worker sync() path.
  • The controller’s non-K8s execution model is now explicit: scheduling loop, ping loop, and batched task-update loop.
  • Worker reconciliation stays split: StartTasks launches work, PollTasks reconciles expected tasks, and UpdateTaskStatus pushes state transitions back to the controller.
  • K8s direct-provider code is mostly unchanged; the PR mainly removes stale shared terminology from that path.

4. Major Changes

  • Removed legacy heartbeat protos, generated bindings, and worker service endpoints.
  • Removed the use_split_heartbeat controller config flag and its call sites.
  • Simplified WorkerProvider and controller transitions by deleting DispatchBatch/drain/requeue legacy machinery.
  • Rewrote affected tests to exercise the split RPC surface directly.
  • Updated Iris docs/examples to stop referring to the monolithic heartbeat path.

5. Testing

  • cd lib/iris && uv run --group dev python -m pytest -n1 --tb=short tests/cluster/worker/test_worker.py tests/cluster/controller/test_service.py
  • cd lib/iris && uv run --no-sync --group dev python -m pytest -n1 --tb=short tests/cluster/controller/test_transitions.py -k 'task_update_worker_failed_cascades_children or max_failures_kills_direct_provider_tasks'
  • cd lib/iris && uv run --no-sync --group dev python -m pytest -n1 --tb=short tests/cluster/controller/test_reservation.py -k 'holder_task_worker_death_no_failure_record or get_running_tasks_for_poll_excludes_reservation_holders'

Code Review

Comments

No inline comments.

Review Summary

🤖 Approved.

No issues found. Checked for bugs and AGENTS.md compliance.

Validation:

  • targeted worker and controller service tests passed
  • targeted transition and reservation regression tests passed

Generated with Codex

@rjpower rjpower merged commit 4ba02bd into main Apr 23, 2026
42 of 43 checks passed
@rjpower rjpower deleted the claude/keen-perlman-26e3ab branch April 23, 2026 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[iris] Remove heartbeat log forwarding now that workers push directly to log service

2 participants