Skip to content

[iris] Low-risk simplification: decompose controller, fix leaks, consolidate mutations#4412

Merged
rjpower merged 5 commits intomainfrom
claude/clever-agnesi
Apr 5, 2026
Merged

[iris] Low-risk simplification: decompose controller, fix leaks, consolidate mutations#4412
rjpower merged 5 commits intomainfrom
claude/clever-agnesi

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 4, 2026

Replace _checkpoint_in_progress bool with threading.Event to fix a
cross-thread visibility race on the checkpoint pause flag. Fix 12+ RPC
client resource leaks across 6 CLI files by introducing an rpc_client()
context manager. Document the orphaned TASK_STATE_PREEMPTED in
task-states.md and add 3 preemption tests.

Decompose _run_scheduling() (190 lines) into 6 phase methods and
_sync_all_execution_units() (108 lines) into 5 helpers. Extract
_terminate_task(), _assign_task(), _remove_worker() as top-level helpers
in transitions.py to centralize multi-table invariants (worker column
clearing, endpoint cleanup, resource decommit) previously copy-pasted
across 6-9 call sites. Move 4 SQL boilerplate helpers into queries.py.
Rename test_kind_gpu_canary.py to test_iris_kind.py.

@rjpower rjpower added the agent-generated Created by automation/agent label Apr 4, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 4, 2026

Claude finished @rjpower's task in 5m 36s —— View job


Review of PR #4412: [iris] Low-risk simplification

  • Read all changed files and understand the changes
  • Review controller decomposition
  • Review transitions.py mutation consolidation
  • Review queries.py extraction
  • Review CLI resource leak fixes
  • Review checkpoint race fix
  • Review preemption tests
  • Review docs changes
  • Post final review

Bug: _terminate_task sets finished_at_ms on PENDING retries (regression)

Severity: High — behavioral regression that corrupts task metadata.

_terminate_task at transitions.py:393 computes:

finished_at_ms = None if state in ACTIVE_TASK_STATES else now_ms

ACTIVE_TASK_STATES is {ASSIGNED, BUILDING, RUNNING} — it does not include PENDING. So when preempt_task (line 2058) or _on_worker_failed (line 1823) calls _terminate_task with state=TASK_STATE_PENDING (retry path), the helper sets finished_at_ms = now_ms instead of NULL.

The original code at lines 1723 and 1989 of main explicitly handled this:

finished_ms = None if new_state == cluster_pb2.TASK_STATE_PENDING else now_ms

Fix: Change line 393 to also exclude PENDING:

finished_at_ms = None if (state in ACTIVE_TASK_STATES or state == cluster_pb2.TASK_STATE_PENDING) else now_ms

The test test_preempted_task_retries should also assert finished_at_ms is None to prevent future regressions.

Fix this →


Other findings

Checkpoint Event — Good improvement

The threading.Event for _checkpoint_paused (controller.py:1073) is a clear improvement over a plain bool. The set()/clear() in begin_checkpoint (line 2211/2231) with try/finally is correct, and the is_set() checks in scheduling/autoscaler loops are properly placed.

RPC client context manager — Clean fix

The rpc_client() context manager in cli/main.py:77-89 is well-structured. All 12+ call sites across the CLI files now use with rpc_client(...) correctly — verified no remaining leaks.

Controller decomposition — Well-structured

The new dataclasses (_SchedulingStateRead, _GatedCandidates, _SchedulingOrder, _SyncFailureAccumulator) provide clear phase boundaries. The frozen variants enforce immutability at phase transitions.

queries.py — Clean extraction

The four SQL helpers are straightforward, parameterized, and appropriately scoped. No SQL injection concerns — all values are bound via ? placeholders.

Dynamic SQL in _terminate_task (transitions.py:428)

f"UPDATE tasks SET {', '.join(set_clauses)} WHERE task_id = ?"

The set_clauses are built from hardcoded string literals, so this is safe. However, a comment noting that set_clauses must only contain literals would help future maintainers. Minor nit.

_remove_worker — Good consolidation

The four-statement invariant (attempts → tasks → dispatch_queue → workers) correctly prevents dangling FK references. This was previously scattered and easy to get wrong.

Task-states.md diagram

The new PREEMPTED diagram block (lines 58-68) has trailing | pipe characters that appear to be carried over from the adjacent retry diagram but aren't connected to anything — just a minor formatting artifact.

Tests — Solid coverage

The 3 new preemption integration tests (test_preempted_task_retries, test_preempted_task_exhausted_retries, test_preemption_cascades_coscheduled_siblings) cover the important state transitions. The unit tests for _run_preemption_pass are thorough (same-band rejection, coscheduled immunity, cheapest-victim selection, budget downgrade).


Summary

One blocking bug: _terminate_task incorrectly stamps finished_at_ms on PENDING retries, which is a behavioral regression from the pre-refactor code. The rest of the PR is solid — the decomposition, leak fixes, and test additions are all good improvements.

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 4, 2026

Specification

Problem

  1. _checkpoint_in_progress is a plain bool written by the checkpoint thread and read by 5 background loops with no synchronization. A stale read lets scheduling proceed mid-checkpoint, producing an inconsistent snapshot.

  2. RPC clients in cli/cluster.py, process_status.py, and other CLI files are created without .close(), leaking file descriptors and connection state.

  3. TASK_STATE_PREEMPTED (proto value 10) exists in the state machine but is absent from task-states.md and has zero test coverage.

  4. _run_scheduling() (190 lines) and _sync_all_execution_units() (108 lines) concatenate 5-6 distinct phases into single methods, making per-phase testing impossible.

  5. Multi-table invariants (clear current_worker_id + current_worker_address, delete endpoints, decommit resources) are copy-pasted across 6-9 call sites in transitions.py. Forgetting any column at one site creates a dangling reference.

Approach

Checkpoint fix (controller.py): Swap self._checkpoint_in_progress bool for self._checkpoint_paused = threading.Event(). Update 2 set/clear sites and 5 is_set() check sites.

RPC leaks (cli/): Add rpc_client() context manager in cli/main.py. Replace all bare client construction across cluster.py (7 sites), process_status.py (3 sites), task.py, query.py, rpc.py with with rpc_client(...) as client:.

PREEMPTED docs+tests (docs/task-states.md, test_preemption.py): Add PREEMPTED to state table, transition diagram, retry semantics, and terminal state summary. Add test_preempt_task_retries_when_budget_remains, test_preempt_task_terminal_when_budget_exhausted, test_preempt_task_cascades_coscheduled_siblings.

Controller decomposition (controller.py): Extract _run_scheduling into _refresh_reservation_claims, _read_scheduling_state, _apply_scheduling_gates, _compute_scheduling_order, _run_scheduler_pass, _apply_preemptions. Extract _sync_all_execution_units into _separate_sync_results, _apply_successful_heartbeats, _handle_failed_heartbeats, _handle_sibling_worker_failures, _log_sync_health_summary. Frozen dataclasses (_SchedulingStateRead, _GatedCandidates, _SchedulingOrder) and a mutable _SyncFailureAccumulator carry data between phases.

Mutation helpers (transitions.py): _terminate_task(cur, task_id, attempt_id, state, error, now_ms, *, attempt_state, worker_id, resources, failure_count, preemption_count) centralizes attempt finalization + task state update + worker column clearing + endpoint deletion + resource decommit. _assign_task(cur, task_id, worker_id, worker_address, attempt_id, now_ms) centralizes attempt creation + denormalized column sync. _remove_worker(cur, worker_id) centralizes the 4-statement worker teardown. SQL boilerplate (delete_task_endpoints, enqueue_run_dispatch, enqueue_kill_dispatch, insert_task_attempt) moved to queries.py.

Tests

3 new preemption tests exercise the PREEMPTED state through the existing ControllerTestHarness. All 1776 existing tests pass. No mocks added; tests use the real DB and transition engine.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1149afa299

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1608 to +1611
preemptions = self._apply_preemptions(
order.ordered_task_ids,
gated.jobs,
order.task_band_map,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass taint-injected requirements to preemption pass

_run_scheduler_pass() now applies reservation taint constraints into a local modified_jobs, but _apply_preemptions() is still called with gated.jobs (the unmodified requirements). In clusters using direct reservations, this allows preemption candidates to be evaluated without the required reservation EQ constraint, so a reserved task can evict lower-priority work on unclaimed workers even though it can never run there. This introduces avoidable task churn and incorrect preemption decisions compared to the previous behavior where preemption used taint-injected requirements.

Useful? React with 👍 / 👎.

@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 4, 2026

Code review

Found 2 issues:

  1. _terminate_task sets finished_at_ms = now_ms when a task retries to PENDING (behavioral regression). The condition finished_at_ms = None if state in ACTIVE_TASK_STATES else now_ms does not account for PENDING, which is not in ACTIVE_TASK_STATES. The old code in _remove_failed_worker and preempt_task explicitly computed finished_ms = None if new_state == PENDING else now_ms. Tasks retrying after worker failure or preemption will now have a spurious finished_at_ms timestamp.

"""
finished_at_ms = None if state in ACTIVE_TASK_STATES else now_ms
effective_attempt_state = attempt_state if attempt_state is not None else state

  1. _apply_preemptions() receives gated.jobs (un-tainted) instead of the taint-constraint-injected modified_jobs. In the old monolithic _run_scheduling, jobs was reassigned to the tainted version via jobs = _inject_taint_constraints(...), so all downstream phases (scheduler, preemption, diagnostics) used the same tainted dict. After decomposition, _run_scheduler_pass creates modified_jobs internally but does not return it. The preemption pass then evaluates candidates without reservation NOT_EXISTS constraints, potentially allowing non-reservation tasks to incorrectly evict work on unclaimed workers. Same issue affects _cache_scheduling_diagnostics.

preemptions = self._apply_preemptions(
order.ordered_task_ids,
gated.jobs,
order.task_band_map,
all_assignments,
claims,
order.user_spend,
order.user_budget_limits,
context,
)
self._cache_scheduling_diagnostics(context, gated.jobs, all_assignments, order.ordered_task_ids)

Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@rjpower rjpower requested a review from yonromai April 4, 2026 19:09
rjpower added 4 commits April 4, 2026 12:10
…tests

- Replace _checkpoint_in_progress bool with threading.Event to eliminate
  cross-thread visibility race on checkpoint pause flag.
- Fix RPC client resource leaks across 6 CLI files (12+ sites). Add
  rpc_client() context manager in cli/main.py that guarantees cleanup.
- Document TASK_STATE_PREEMPTED in task-states.md (was orphaned from docs)
  and add 3 tests covering retry-with-budget, terminal-at-exhaustion, and
  coscheduled sibling cascade.
- Decompose _run_scheduling() (190 lines) into 6 named phase methods with
  typed intermediate dataclasses for inter-phase data flow.
- Decompose _sync_all_execution_units() (108 lines) into 5 helpers with a
  mutable accumulator for failure tracking across phases.
- Extract 4 repeated SQL mutation helpers (delete_task_endpoints,
  enqueue_run_dispatch, enqueue_kill_dispatch, insert_task_attempt) into
  new queries.py module, replacing 20 inline call sites in transitions.py.
- Add kind-based K8s local test fixtures (session-scoped kind cluster) and
  3 pod lifecycle smoke tests behind @pytest.mark.kind marker.
…, rename test file

- Add _terminate_task(), _assign_task(), _remove_worker() as top-level
  helpers in transitions.py. These enforce multi-table invariants (worker
  column clearing, endpoint cleanup, resource decommit) that were
  previously copy-pasted across 6-9 call sites each.
- Remove lib/iris/tests/e2e/kind_fixtures.py and test_k8s_local.py which
  duplicated the existing tests/integration/iris/test_kind_gpu_canary.py
  with weaker coverage.
- Revert kind marker from pyproject.toml and TESTING.md (not needed).
- Rename test_kind_gpu_canary.py to test_iris_kind.py.
Fix two bugs introduced by the decomposition:

1. _terminate_task set finished_at_ms = now_ms for tasks retrying to
   PENDING (not in ACTIVE_TASK_STATES). The old code explicitly checked
   for PENDING and set finished_at_ms = None. Add PENDING to the
   condition so retried tasks do not get a spurious finished timestamp.

2. _run_scheduler_pass created modified_jobs with taint constraints but
   did not return it. _apply_preemptions and _cache_scheduling_diagnostics
   received the un-tainted gated.jobs, dropping reservation NOT_EXISTS
   constraints from preemption evaluation. Return modified_jobs from
   _run_scheduler_pass and pass it downstream.
The separate queries.py module added indirection without meaningful
reuse outside transitions.py. Move the 4 helpers (delete_task_endpoints,
enqueue_run_dispatch, enqueue_kill_dispatch, insert_task_attempt)
directly into transitions.py alongside the compound mutation helpers
that call them.
@rjpower rjpower force-pushed the claude/clever-agnesi branch from f5d7b11 to a1fbcba Compare April 4, 2026 19:10
- rpc_client(): drop redundant @contextmanager wrapper since
  ConnectClientSync already implements __enter__/__exit__
- _run_scheduler_pass / _apply_preemptions: accept dataclass objects
  instead of destructured fields (8 params → 5 each)
- Add thread-safety comment on _checkpoint_paused Event
@rjpower rjpower force-pushed the claude/clever-agnesi branch from a1fbcba to 106a290 Compare April 4, 2026 19:17
@rjpower rjpower enabled auto-merge (squash) April 4, 2026 20:22
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the current head.

I re-reviewed the latest revision after the earlier automated findings. The two prior blocking issues around retry finished_at_ms handling and reservation taint propagation are fixed on 106a29058d9a099f3d40f493bba8abd883fb4419, and I did not find a new blocker on top of that. I also ran the targeted controller test slice locally and it passed (144 passed).

Generated with Codex.

@rjpower rjpower merged commit 49d880c into main Apr 5, 2026
38 of 39 checks passed
@rjpower rjpower deleted the claude/clever-agnesi branch April 5, 2026 01:02
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
…olidate mutations (#4412)

Replace _checkpoint_in_progress bool with threading.Event to fix a
cross-thread visibility race on the checkpoint pause flag. Fix 12+ RPC
client resource leaks across 6 CLI files by introducing an rpc_client()
context manager. Document the orphaned TASK_STATE_PREEMPTED in
task-states.md and add 3 preemption tests.

Decompose _run_scheduling() (190 lines) into 6 phase methods and
_sync_all_execution_units() (108 lines) into 5 helpers. Extract
_terminate_task(), _assign_task(), _remove_worker() as top-level helpers
in transitions.py to centralize multi-table invariants (worker column
clearing, endpoint cleanup, resource decommit) previously copy-pasted
across 6-9 call sites. Move 4 SQL boilerplate helpers into queries.py.
Rename test_kind_gpu_canary.py to test_iris_kind.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants