[iris] Skip decommit for reservation-holder tasks in _kill_non_terminal_tasks#4878
[iris] Skip decommit for reservation-holder tasks in _kill_non_terminal_tasks#4878
Conversation
…al_tasks Reservation-holder tasks never commit resources on assignment, so they must not decommit on termination. When a reservation holder finalized on a worker co-tenanted with a real task, the bogus subtraction floored committed_* to zero via MAX(0, ...), letting the scheduler double-book the VM. Observed in prod: two v5p-8 jobs on the same 4-chip VM, second crashing on /dev/vfio/0 busy with up to 6 retries before escaping to another slice.
|
Claude finished @rjpower's task in 1m 10s —— View job Code review
Summary: The fix is logically sound. Issues (1):
No bugs or logic issues found. |
| letting the scheduler double-book the VM (seen in prod: two v5p-8 jobs on | ||
| the same 4-chip VM, with the second crashing on ``/dev/vfio/0 busy``). | ||
| """ | ||
| from iris.cluster.controller.transitions import _kill_non_terminal_tasks |
There was a problem hiding this comment.
Local import inside the test body. Per AGENTS.md — Code Style: "All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps." There is no circular-dep or optional-dep justification here — iris.cluster.controller.transitions is already imported at the top of the file.
Suggested fix: remove this line and add _kill_non_terminal_tasks to the existing top-level import block. (Noting that the file already has several pre-existing local imports that violate this rule; this comment is scoped to the new one.)
|
N.B. Smoke test is a known bug, fix incoming. |
Reservation-holder tasks never commit resources on assignment, so they must not decommit on termination. When a reservation holder finalized on a worker co-tenanted with a real task, the bogus subtraction floored committed_* to zero via MAX(0, ...), letting the scheduler double-book the VM. Observed in prod: two v5p-8 jobs on the same 4-chip VM, second crashing on /dev/vfio/0 busy with up to 6 retries before escaping to another slice.
Part of #4878