Skip to content

[iris] Skip decommit for reservation-holder tasks in _kill_non_terminal_tasks#4878

Merged
rjpower merged 1 commit intomainfrom
fix-reservation-holder-decommit-leak
Apr 17, 2026
Merged

[iris] Skip decommit for reservation-holder tasks in _kill_non_terminal_tasks#4878
rjpower merged 1 commit intomainfrom
fix-reservation-holder-decommit-leak

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 17, 2026

Reservation-holder tasks never commit resources on assignment, so they must not decommit on termination. When a reservation holder finalized on a worker co-tenanted with a real task, the bogus subtraction floored committed_* to zero via MAX(0, ...), letting the scheduler double-book the VM. Observed in prod: two v5p-8 jobs on the same 4-chip VM, second crashing on /dev/vfio/0 busy with up to 6 retries before escaping to another slice.

Part of #4878

…al_tasks

Reservation-holder tasks never commit resources on assignment, so they must
not decommit on termination. When a reservation holder finalized on a worker
co-tenanted with a real task, the bogus subtraction floored committed_* to
zero via MAX(0, ...), letting the scheduler double-book the VM. Observed in
prod: two v5p-8 jobs on the same 4-chip VM, second crashing on /dev/vfio/0
busy with up to 6 retries before escaping to another slice.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 17, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 17, 2026

Claude finished @rjpower's task in 1m 10s —— View job


Code review

  • Check if PR should be reviewed
  • Find relevant CLAUDE.md/AGENTS.md files
  • Summarize PR changes
  • Run parallel review agents (2 CLAUDE.md, 2 bug/logic)
  • Validate flagged issues
  • Post review

Summary: The fix is logically sound. _kill_non_terminal_tasks now skips decommit for reservation-holder tasks, preventing the committed_* underflow that caused v5p-8 double-booking in prod. task_kill_workers is still populated so callers can still send StopTasks RPCs. The new regression test exercises the exact finalization path and asserts co-tenant counters are preserved.

Issues (1):

  • Local import in test_kill_non_terminal_tasks is imported inside the test body, but the transitions module is already imported at the top. Minor AGENTS.md violation.

No bugs or logic issues found.
· Branch

letting the scheduler double-book the VM (seen in prod: two v5p-8 jobs on
the same 4-chip VM, with the second crashing on ``/dev/vfio/0 busy``).
"""
from iris.cluster.controller.transitions import _kill_non_terminal_tasks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local import inside the test body. Per AGENTS.md — Code Style: "All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps." There is no circular-dep or optional-dep justification here — iris.cluster.controller.transitions is already imported at the top of the file.

Suggested fix: remove this line and add _kill_non_terminal_tasks to the existing top-level import block. (Noting that the file already has several pre-existing local imports that violate this rule; this comment is scoped to the new one.)

@rjpower rjpower requested a review from yonromai April 17, 2026 22:38
@rjpower rjpower merged commit 33074ce into main Apr 17, 2026
48 of 50 checks passed
@rjpower rjpower deleted the fix-reservation-holder-decommit-leak branch April 17, 2026 23:03
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 17, 2026

N.B. Smoke test is a known bug, fix incoming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants