Skip to content

[iris] Chunk terminal-task history eviction across transactions#4851

Open
rjpower wants to merge 2 commits intomainfrom
iris-prune-chunk-txn
Open

[iris] Chunk terminal-task history eviction across transactions#4851
rjpower wants to merge 2 commits intomainfrom
iris-prune-chunk-txn

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 16, 2026

Move the SELECT of expired task_ids to read_snapshot (no writer held) and give each DELETE chunk its own write transaction, mirroring fail_heartbeats_batch. Drop chunk size from 5000 to 1000: sweep on a 1M-row prod checkpoint shows chunk=1000 holds the writer for p95 ~400ms / max 1.3s per chunk vs 5000's 1.6s / 1.7s. Total wall is ~15% longer but the prune loop runs every 10 min — bounding worst-case writer hold is what matters for concurrent RPCs.

Also split provider-sync failure logs into transient vs terminal buckets so terminal failures surface at WARNING while transient blips stay at DEBUG.

Move the SELECT of expired task_ids to read_snapshot (no writer held) and
give each DELETE chunk its own write transaction, mirroring the pattern in
fail_heartbeats_batch. Drop chunk size from 5000 to 1000: sweep on a 1M-row
prod checkpoint shows chunk=1000 holds the writer for p95 ~400ms / max 1.3s
per chunk vs 5000's 1.6s / 1.7s. Total wall is ~15% longer but the prune
loop runs every 10 min — bounding worst-case writer hold is what matters
for concurrent RPCs.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 16, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

Claude finished @rjpower's task in 1m 46s —— View job


Code Review

No issues found. Checked for bugs and CLAUDE.md / AGENTS.md compliance.

Summary of what was reviewed (lib/iris/src/iris/cluster/controller/transitions.py, +16/-10):

  • SELECT of expired terminal task_ids moved from a write transaction to self._db.read_snapshot(), releasing the writer during the scan.
  • Each DELETE chunk now runs in its own self._db.transaction(), mirroring fail_heartbeats_batch (transitions.py L2240-L2260).
  • TASK_RESOURCE_HISTORY_DELETE_CHUNK reduced 5000 → 1000; docstring documents the p95/max writer-hold tradeoff from the 1M-row prod sweep.

Checks performed

  • Correctness of the snapshot/write split: the SELECT filters on state IN TERMINAL_TASK_STATES AND finished_at_ms < cutoff. Terminal states are final, so there is no risk of rows being added to task_resource_history for these ids between snapshot and delete, and DELETE ... WHERE task_id IN (...) is idempotent — partial progress on crash is safe (same invariant fail_heartbeats_batch relies on).
  • Chunk iteration bounds: range(0, len(terminal_ids), CHUNK) + slice guarantees non-empty chunks, so the IN ({ph}) placeholder string is never empty.
  • Second-pass log-downsampling block (L2832+) is untouched — intentional and still wrapped in a single write transaction, which is fine given its ~160-row upper bound.
  • AGENTS.md / iris AGENTS.md: change is consistent with the file-level guidance that "Read-only queries do NOT belong [in write transactions] — callers use db.read_snapshot() directly" (transitions.py L6). No new imports, abstractions, or compat shims introduced.
    • Branch: iris-prune-chunk-txn

@rjpower rjpower requested a review from yonromai April 17, 2026 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant