Conversation
Large TPU pods were timing out while queued resources became ready and while the last workers finished bootstrap. Scale the wait budgets by pod size and log the missing workers and probe errors that remain when bootstrap stalls.
|
Claude finished @dlwh's task in 4m 57s —— View job PR Review complete
Summary: No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance. Two candidate issues were raised by the compliance agents and validated:
Bug agents found no syntax, logic, timeout-semantic, or error-handling issues; tests exercise real boundary values (1 / 64 / 256). |
Code review🤖 No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance. |
| return DEFAULT_TPU_BOOTSTRAP_TIMEOUT | ||
|
|
||
|
|
||
| def _format_probe_error(error: BaseException) -> str: |
There was a problem hiding this comment.
this is just str(error) ?
| # ============================================================================= | ||
|
|
||
|
|
||
| def test_recommended_tpu_timeouts_scale_with_pod_size(): |
|
@claude recommended.*timeout -> default.*timeout, fix up the slop as per PR comments |
|
Claude finished @rjpower's task in 3m 43s —— View job Code changesAlready on the branch — PR author pushed
I had staged an equivalent local commit and reset it in favor of the author's version (which picks a slightly more defensive fallback for empty exception messages). No push from me was needed. Validation
|
Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. Worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices. Complements #4792 and #4793.
Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. The worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices. Complements #4792 and #4793.
Scale TPU queued-resource and bootstrap waits by pod size and log the missing workers and probe errors when bootstrap stalls. This covers the 255/256 healthy provisioning path from base issue #4697.
Part of #4746