[iris] Evict terminal-task resource history past 1h TTL#4850
Conversation
task_resource_history accumulated ~1M rows on marin prod, ~85% for tasks already in terminal states; the existing log-downsample prune only thinned, never evicted. Extends prune_task_resource_history with a TTL pass that drops history for tasks finished more than 1h ago. On the cached marin checkpoint this cut apply_heartbeats_batch baseline p95 from 5.6s to 158ms (~35x). Adds a compound-contention benchmark (benchmark_apply_contention) and fixes clone_db to preserve UNIQUE constraints so register_worker runs.
|
Claude finished @rjpower's task in 5m 21s —— View job Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. Scope reviewed
Checks performed
--- · Branch: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 45bf35fd59
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| terminal_placeholders = ",".join("?" for _ in TERMINAL_TASK_STATES) | ||
|
|
||
| evicted_terminal = 0 | ||
| with self._db.transaction() as cur: |
There was a problem hiding this comment.
Commit each TTL delete chunk in separate transaction
prune_task_resource_history wraps the entire terminal-TTL eviction in one self._db.transaction() block, so all chunked DELETE ... IN (...) statements run under a single BEGIN IMMEDIATE write lock. Because ControllerDB.transaction() holds that lock until the context exits, the new chunking does not actually let other RPC writes interleave; with large terminal task sets this can still block heartbeat/scheduling writes for the full eviction duration and recreate multi-second contention spikes.
Useful? React with 👍 / 👎.
task_resource_history accumulated ~1M rows on marin prod, ~85% for tasks already in terminal states; the existing log-downsample prune only thinned, never evicted. Extends prune_task_resource_history with a TTL pass that drops history for tasks finished more than 1h ago. On the cached marin checkpoint this cut apply_heartbeats_batch baseline p95 from 5.6s to 158ms (~35x). Adds a compound-contention benchmark (benchmark_apply_contention) and fixes clone_db to preserve UNIQUE constraints so register_worker exercises the same UPSERT path as prod.