Skip to content

[iris] Evict terminal-task resource history past 1h TTL#4850

Merged
rjpower merged 1 commit intomainfrom
iris-task-history-ttl-prune
Apr 16, 2026
Merged

[iris] Evict terminal-task resource history past 1h TTL#4850
rjpower merged 1 commit intomainfrom
iris-task-history-ttl-prune

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 16, 2026

task_resource_history accumulated ~1M rows on marin prod, ~85% for tasks already in terminal states; the existing log-downsample prune only thinned, never evicted. Extends prune_task_resource_history with a TTL pass that drops history for tasks finished more than 1h ago. On the cached marin checkpoint this cut apply_heartbeats_batch baseline p95 from 5.6s to 158ms (~35x). Adds a compound-contention benchmark (benchmark_apply_contention) and fixes clone_db to preserve UNIQUE constraints so register_worker exercises the same UPSERT path as prod.

task_resource_history accumulated ~1M rows on marin prod, ~85% for tasks
already in terminal states; the existing log-downsample prune only thinned,
never evicted. Extends prune_task_resource_history with a TTL pass that
drops history for tasks finished more than 1h ago. On the cached marin
checkpoint this cut apply_heartbeats_batch baseline p95 from 5.6s to 158ms
(~35x). Adds a compound-contention benchmark (benchmark_apply_contention)
and fixes clone_db to preserve UNIQUE constraints so register_worker runs.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 16, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

Claude finished @rjpower's task in 5m 21s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Scope reviewed

  • lib/iris/src/iris/cluster/controller/transitions.py — new TTL eviction pass in prune_task_resource_history
  • lib/iris/tests/cluster/controller/test_task_resource_history.py — three new tests
  • lib/iris/scripts/benchmark_db_queries.pyclone_db DDL fix + benchmark_apply_contention

Checks performed

  • All referenced symbols resolve (TERMINAL_TASK_STATES, Timestamp, Duration, tasks.finished_at_ms column, HeartbeatApplyRequest, DispatchBatch, _build_sample_worker_metadata, etc.).
  • SQL parameter binding order matches placeholders: state IN (?,?,?...) then finished_at_ms < ? with (*TERMINAL_TASK_STATES, ttl_cutoff_ms).
  • DELETE ... WHERE task_id IN (?,...) chunk size of 5000 is within SQLite's 32,766 bound-parameter limit.
  • clone_db fix correctly reuses source CREATE TABLE DDL (preserving UNIQUE constraints) before copying rows, with explicit commit() before DETACH.
  • New tests cover the three meaningful cases: evict past TTL, keep within TTL, and don't evict a RUNNING task with stale finished_at_ms.
  • Compliant with /AGENTS.md and lib/iris/AGENTS.md: uses rigging.timing (Timestamp, Duration), context managers for transactions, top-level imports, top-level constants for the TTL and chunk size.

--- · Branch: iris-task-history-ttl-prune

@rjpower rjpower requested review from ravwojdyla and yonromai April 16, 2026 23:22
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45bf35fd59

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

terminal_placeholders = ",".join("?" for _ in TERMINAL_TASK_STATES)

evicted_terminal = 0
with self._db.transaction() as cur:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Commit each TTL delete chunk in separate transaction

prune_task_resource_history wraps the entire terminal-TTL eviction in one self._db.transaction() block, so all chunked DELETE ... IN (...) statements run under a single BEGIN IMMEDIATE write lock. Because ControllerDB.transaction() holds that lock until the context exits, the new chunking does not actually let other RPC writes interleave; with large terminal task sets this can still block heartbeat/scheduling writes for the full eviction duration and recreate multi-second contention spikes.

Useful? React with 👍 / 👎.

@rjpower rjpower merged commit e78e1dc into main Apr 16, 2026
47 of 48 checks passed
@rjpower rjpower deleted the iris-task-history-ttl-prune branch April 16, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants