[iris] Raise worker heartbeat RPC timeout by dlwh · Pull Request #4793 · marin-community/marin

dlwh · 2026-04-15T21:37:35Z

Raise the controller-to-worker heartbeat RPC timeout and log failure counts, stale ages, and worker addresses when heartbeats fail or stale workers are reaped. This covers the registered-but-never-advanced path from base issue #4697.

Part of #4746

Large TPU workers can take longer than five seconds to finish a heartbeat because the handler is synchronous. Raise the controller RPC budget and log enough worker state to explain transient heartbeat failures and stale reaping.

claude · 2026-04-15T21:37:55Z

Claude finished @dlwh's task in 2m 58s —— View job

Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

Notes from the review pass:

controller.py log includes worker=, address=, action=, failures=N/M, last_success_age_s=, plus expected/run/kill counts and the underlying error — fields are wired through HeartbeatFailureResult correctly and logging.ERROR/WARNING matches the action.
transitions.py correctly hoists now_ms = now_ms or Timestamp.now().epoch_ms() so it's available before computing last_heartbeat_age_ms; the early-return path (worker row missing) sets failure_threshold so the controller log doesn't show 0/0.
worker_provider.py wraps the RPC in try/except, prepends RPC context to both ProviderError and unexpected exceptions, and emits a Slow heartbeat RPC succeeded warning past 10s. getattr(self.stub_factory, "timeout_ms", None) is reasonable since the WorkerStubFactory Protocol doesn't declare it (test fakes work without it).
New tests cover the diagnostic log content, the 30s default RPC timeout, and the error-string context propagation.
Check PR status / prior reviews
Identify relevant CLAUDE.md/AGENTS.md files
Summarize PR changes
Parallel review (CLAUDE.md compliance + bugs)
Validate flagged issues
Post review
| Branch

rjpower · 2026-04-15T21:57:07Z

+                health_error = response.health_error or "worker reported unhealthy"
+                raise ProviderError(f"worker {batch.worker_id} reported unhealthy: {health_error}")
+
+            # Forward log entries from old workers that still piggyback logs on


we can delete this, all workers send logs directly now

🤖 Removed the legacy heartbeat-response log forwarding path in 63ae1a5d8. That also let me drop WorkerProvider.log_pusher, remove the controller-side wiring for WorkerProvider, and delete the old-worker fallback tests from test_heartbeat.py.

Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. Worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices. Complements #4792 and #4793.

Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. The worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices. Complements #4792 and #4793.

[iris] Raise worker heartbeat RPC timeout

5a43c8b

Large TPU workers can take longer than five seconds to finish a heartbeat because the handler is synchronous. Raise the controller RPC budget and log enough worker state to explain transient heartbeat failures and stale reaping.

dlwh added the agent-generated Created by automation/agent label Apr 15, 2026

rjpower approved these changes Apr 15, 2026

View reviewed changes

iris: drop legacy heartbeat log forwarding

63ae1a5

rjpower mentioned this pull request Apr 15, 2026

[iris] Add heartbeat and slice-lifecycle debug logging #4796

Merged

dlwh merged commit 089335f into main Apr 15, 2026
44 of 45 checks passed

dlwh deleted the codex/iris-heartbeat-rpc-timeout branch April 15, 2026 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Raise worker heartbeat RPC timeout#4793

[iris] Raise worker heartbeat RPC timeout#4793
dlwh merged 2 commits intomainfrom
codex/iris-heartbeat-rpc-timeout

dlwh commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

rjpower Apr 15, 2026

Uh oh!

dlwh Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dlwh commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

rjpower Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

dlwh Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 15, 2026 •

edited

Loading