Conversation
Coscheduled retries could briefly leave old coordinator endpoints visible, letting tasks bootstrap against different JAX coordinators. Prefer the newest endpoint and clear a task's stale endpoints before assigning a retry so distributed init converges on a single coordinator.
|
Claude finished @Calvin-Xu's task in 4m 17s —— View job Code reviewNo high-signal issues found. Checked for bugs and CLAUDE.md / AGENTS.md compliance. Summary of what the PR does Belt-and-suspenders fix for stale JAX coordinator resolution across retries:
Minor observation (non-blocking): The two new
|
|
This seems reasonable, but @claude, don't we already have a post-condition from the TASK_PREEMPTED/TASK_FAILED path which deletes registry entries from tasks? We should be deleting the entry long before we try to reschedule. |
|
Claude finished @rjpower's task in 3m 36s —— View job
AnswerGood question. Yes — both existing failure paths already delete endpoints before the task enters PENDING for retry. The cleanup happens well before rescheduling:
So Russell is right — under normal flow, endpoints are gone before the task is even eligible for rescheduling. What the PR guards against is re-registration in the window between cleanup and dispatch: since Whether this race is worth defending against in practice is a judgment call — it requires the old worker to register an endpoint after the controller has already processed the failure and set the task to PENDING, which is a narrow window. But for JAX coordinator resolution specifically (where |
Delete stale endpoint registrations before retried task attempts so workers cannot resolve an old coordinator. Add regression coverage for the retry path and Iris TPU distributed initialization so failed attempts do not split bootstrap state across retries.