Conversation
Large TPU workers can take longer than five seconds to finish a heartbeat because the handler is synchronous. Raise the controller RPC budget and log enough worker state to explain transient heartbeat failures and stale reaping.
|
Claude finished @dlwh's task in 2m 58s —— View job Code reviewNo issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance. Notes from the review pass:
|
| health_error = response.health_error or "worker reported unhealthy" | ||
| raise ProviderError(f"worker {batch.worker_id} reported unhealthy: {health_error}") | ||
|
|
||
| # Forward log entries from old workers that still piggyback logs on |
There was a problem hiding this comment.
we can delete this, all workers send logs directly now
There was a problem hiding this comment.
🤖 Removed the legacy heartbeat-response log forwarding path in 63ae1a5d8. That also let me drop WorkerProvider.log_pusher, remove the controller-side wiring for WorkerProvider, and delete the old-worker fallback tests from test_heartbeat.py.
Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. Worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices. Complements #4792 and #4793.
Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. The worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices. Complements #4792 and #4793.
Raise the controller-to-worker heartbeat RPC timeout and log failure counts, stale ages, and worker addresses when heartbeats fail or stale workers are reaped. This covers the registered-but-never-advanced path from base issue #4697.
Part of #4746