[iris] Chunk terminal-task history eviction across transactions by rjpower · Pull Request #4851 · marin-community/marin

rjpower · 2026-04-16T23:36:56Z

Move the SELECT of expired task_ids to read_snapshot (no writer held) and give each DELETE chunk its own write transaction, mirroring fail_heartbeats_batch. Drop chunk size from 5000 to 1000: sweep on a 1M-row prod checkpoint shows chunk=1000 holds the writer for p95 ~400ms / max 1.3s per chunk vs 5000's 1.6s / 1.7s. Total wall is ~15% longer but the prune loop runs every 10 min — bounding worst-case writer hold is what matters for concurrent RPCs.

Also split provider-sync failure logs into transient vs terminal buckets so terminal failures surface at WARNING while transient blips stay at DEBUG.

Move the SELECT of expired task_ids to read_snapshot (no writer held) and give each DELETE chunk its own write transaction, mirroring the pattern in fail_heartbeats_batch. Drop chunk size from 5000 to 1000: sweep on a 1M-row prod checkpoint shows chunk=1000 holds the writer for p95 ~400ms / max 1.3s per chunk vs 5000's 1.6s / 1.7s. Total wall is ~15% longer but the prune loop runs every 10 min — bounding worst-case writer hold is what matters for concurrent RPCs.

claude · 2026-04-16T23:37:19Z

Claude finished @rjpower's task in 1m 46s —— View job

Code Review

No issues found. Checked for bugs and CLAUDE.md / AGENTS.md compliance.

Summary of what was reviewed (lib/iris/src/iris/cluster/controller/transitions.py, +16/-10):

SELECT of expired terminal task_ids moved from a write transaction to self._db.read_snapshot(), releasing the writer during the scan.
Each DELETE chunk now runs in its own self._db.transaction(), mirroring fail_heartbeats_batch (transitions.py L2240-L2260).
TASK_RESOURCE_HISTORY_DELETE_CHUNK reduced 5000 → 1000; docstring documents the p95/max writer-hold tradeoff from the 1M-row prod sweep.

Checks performed

Correctness of the snapshot/write split: the SELECT filters on state IN TERMINAL_TASK_STATES AND finished_at_ms < cutoff. Terminal states are final, so there is no risk of rows being added to task_resource_history for these ids between snapshot and delete, and DELETE ... WHERE task_id IN (...) is idempotent — partial progress on crash is safe (same invariant fail_heartbeats_batch relies on).
Chunk iteration bounds: range(0, len(terminal_ids), CHUNK) + slice guarantees non-empty chunks, so the IN ({ph}) placeholder string is never empty.
Second-pass log-downsampling block (L2832+) is untouched — intentional and still wrapped in a single write transaction, which is fine given its ~160-row upper bound.
AGENTS.md / iris AGENTS.md: change is consistent with the file-level guidance that "Read-only queries do NOT belong [in write transactions] — callers use db.read_snapshot() directly" (transitions.py L6). No new imports, abstractions, or compat shims introduced.
• Branch: iris-prune-chunk-txn

rjpower added the agent-generated Created by automation/agent label Apr 16, 2026

Less confusing log messages.

223d1c4

rjpower requested a review from yonromai April 17, 2026 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Chunk terminal-task history eviction across transactions#4851

[iris] Chunk terminal-task history eviction across transactions#4851
rjpower wants to merge 2 commits intomainfrom
iris-prune-chunk-txn

rjpower commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rjpower commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rjpower commented Apr 16, 2026 •

edited

Loading

claude bot commented Apr 16, 2026 •

edited

Loading