Skip to content

[iris] Raise worker heartbeat RPC timeout#4793

Merged
dlwh merged 2 commits intomainfrom
codex/iris-heartbeat-rpc-timeout
Apr 15, 2026
Merged

[iris] Raise worker heartbeat RPC timeout#4793
dlwh merged 2 commits intomainfrom
codex/iris-heartbeat-rpc-timeout

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 15, 2026

Raise the controller-to-worker heartbeat RPC timeout and log failure counts, stale ages, and worker addresses when heartbeats fail or stale workers are reaped. This covers the registered-but-never-advanced path from base issue #4697.

Part of #4746

Large TPU workers can take longer than five seconds to finish a heartbeat because the handler is synchronous. Raise the controller RPC budget and log enough worker state to explain transient heartbeat failures and stale reaping.
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 15, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @dlwh's task in 2m 58s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

Notes from the review pass:

  • controller.py log includes worker=, address=, action=, failures=N/M, last_success_age_s=, plus expected/run/kill counts and the underlying error — fields are wired through HeartbeatFailureResult correctly and logging.ERROR/WARNING matches the action.

  • transitions.py correctly hoists now_ms = now_ms or Timestamp.now().epoch_ms() so it's available before computing last_heartbeat_age_ms; the early-return path (worker row missing) sets failure_threshold so the controller log doesn't show 0/0.

  • worker_provider.py wraps the RPC in try/except, prepends RPC context to both ProviderError and unexpected exceptions, and emits a Slow heartbeat RPC succeeded warning past 10s. getattr(self.stub_factory, "timeout_ms", None) is reasonable since the WorkerStubFactory Protocol doesn't declare it (test fakes work without it).

  • New tests cover the diagnostic log content, the 30s default RPC timeout, and the error-string context propagation.

  • Check PR status / prior reviews

  • Identify relevant CLAUDE.md/AGENTS.md files

  • Summarize PR changes

  • Parallel review (CLAUDE.md compliance + bugs)

  • Validate flagged issues

  • Post review
    | Branch

health_error = response.health_error or "worker reported unhealthy"
raise ProviderError(f"worker {batch.worker_id} reported unhealthy: {health_error}")

# Forward log entries from old workers that still piggyback logs on
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can delete this, all workers send logs directly now

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Removed the legacy heartbeat-response log forwarding path in 63ae1a5d8. That also let me drop WorkerProvider.log_pusher, remove the controller-side wiring for WorkerProvider, and delete the old-worker fallback tests from test_heartbeat.py.

rjpower added a commit that referenced this pull request Apr 15, 2026
Per-task slow_log timers on submit_task (500ms) and synchronous kill
(2000ms) inside handle_heartbeat identify which task stalls a heartbeat.
Worker service heartbeat entrypoint gets an outer slow_log (1000ms) and
a DEBUG payload-size line to correlate with controller-side sync timing.
Slice ready/failed transitions log registered worker counts and ids to
expose partial bootstrap on large slices.

Complements #4792 and #4793.
@dlwh dlwh merged commit 089335f into main Apr 15, 2026
44 of 45 checks passed
@dlwh dlwh deleted the codex/iris-heartbeat-rpc-timeout branch April 15, 2026 22:55
rjpower added a commit that referenced this pull request Apr 15, 2026
Per-task slow_log timers on submit_task (500ms) and synchronous kill
(2000ms) inside handle_heartbeat identify which task stalls a heartbeat.
The worker service heartbeat entrypoint gets an outer slow_log (1000ms)
and a DEBUG payload-size line to correlate with controller-side sync
timing. Slice ready/failed transitions log registered worker counts and
ids to expose partial bootstrap on large slices.

Complements #4792 and #4793.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants