[iris] in-memory worker liveness, slim ListJobs, drop SnapshotView by rjpower · Pull Request #5559 · marin-community/marin

rjpower · 2026-05-08T01:58:50Z

Move worker last_heartbeat_ms/healthy/active/consecutive_failures/committed_* out of the workers SQLite columns into in-memory WorkerHealthTracker / WorkerCommitTracker, eliminating a per-heartbeat writer transaction that was bloating the WAL and starving dashboard reads. Migration 0042 drops the now dormant columns; ListJobs no longer materializes a 26k-row snapshot, slims JobRow 22 to 13 fields, and serves directly from indexed SQL. Fixes a 212s ListJobs symptom seen on prod.

Move worker last_heartbeat_ms/healthy/active/consecutive_failures/committed_* out of the workers SQLite columns into in-memory WorkerHealthTracker / WorkerCommitTracker, eliminating a per-heartbeat writer transaction that was bloating the WAL and starving dashboard reads. Migration 0042 drops the now dormant columns; ListJobs no longer materializes a 26k-row snapshot, slims JobRow 22->13 fields, and serves directly from indexed SQL. Fixes a 212s ListJobs symptom seen on prod.

claude · 2026-05-08T01:59:04Z

Code Review

Skip-check (closed/draft/already-reviewed)
Locate relevant CLAUDE.md / AGENTS.md files
Summarize PR changes
Run 4 parallel reviewers (2x AGENTS compliance, 2x bug-hunters)
Validate each flagged issue
Post inline comments for confirmed issues + summary

View job run

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2acba3ab02

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Worker liveness state (healthy/active/consecutive_failures/last_heartbeat, committed_*) now lives natively on a single public WorkerLiveness dataclass inside WorkerHealthTracker. Readers query the tracker via tracker.liveness() or tracker.liveness_many() instead of relying on the _hydrate_worker_detail overlay seam. WorkerCommitTracker is folded into WorkerHealthTracker so a single mutex covers every per-worker transient signal. Stale ExtraField defaults for healthy / active / consecutive_failures / last_heartbeat / committed_* / available_* are removed from WORKER_ROW_PROJECTION and WORKER_DETAIL_PROJECTION; WorkerRow / WorkerDetailRow now carry only durable identity columns plus the one real ExtraField (attributes, sourced from worker_attributes). A new SchedulableWorker dataclass (in db.py) bundles the durable WorkerRow columns with the live committed totals for the scheduler hot path, replacing the hydrate-then-decorate dance in healthy_active_workers_with_attributes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…istTasks timing The previous tune-up moved both transient liveness and committed_* resource totals out of the workers SQLite row into WorkerHealthTracker. Only the former belongs there: liveness is heartbeat-driven and rewriting it on every ping would serialize through the writer connection. committed_* is durable scheduling state — only mutated by the scheduler under a write transaction — and must survive controller restart so worker capacity stays correct. Fix 1: restore the four committed_* columns on the workers table. Migration 0042 now drops only the four transient liveness columns. WorkerStore's add_committed_resources / decommit_resources take a transaction cursor and issue SQL UPDATEs (the same MAX(0, …) idiom migration 0039 already uses). SchedulableWorker reads committed_* straight from WorkerRow and drops the unused healthy field; the scheduler's "if w.healthy" filter goes with it since healthy_active_workers_with_attributes already filters. Fix 2: a fresh in-memory tracker would otherwise hide every persisted worker from the scheduler until each one pinged back. ControllerStore now seeds liveness from the workers table at boot (and on checkpoint reopen) by issuing a heartbeat for every persisted worker_id. Added a regression test that constructs a store against a DB with rows and asserts both that liveness is populated and that healthy_active_workers_with_attributes returns those workers immediately. Fix 3: _tasks_for_listing left task.attempts empty, so task_to_proto's started_at / finished_at and retry status text never made it into the ListTasks response. The fixed query JOINs only the current attempt via "(ta.task_id, ta.attempt_id) IN (SELECT t.task_id, t.current_attempt_id …)" — at most one row per task, so no IN-clause blowup on long histories. A new test asserts the proto's started_at is populated and proto.attempts length is exactly 1. Goldens regenerated to include the four committed_* columns now visible in worker dumps.

…re/tracker WorkerStore.find_prunable's second branch (DB rows missing a tracker entry, guarded by a 5-minute boot-time grace window) existed only to defend against "checkpoint restore that never re-registered" — already handled by ControllerStore._seed_liveness_from_workers running both at construction and on db.replace_from(). Drop the branch, the ORPHAN_PRUNE_GRACE_MS constant, and _boot_ms; find_prunable now just scans the tracker for stale unhealthy/inactive entries. Every workers row has a tracker entry by construction (seeded at boot, registered on commit of upsert, removed on commit of remove). Also trim change-narration that crept back in: - worker_health.py module docstring: drop the "durable state lives in the SQLite workers row" non-sequitur and the "crash recovery: ControllerStore seeds…" paragraph (relocated to _seed_liveness_from_workers). - WorkerStore class/upsert docstrings: corrected the stale claim that committed_* lives in the tracker. - WorkerRow/WorkerDetailRow/SchedulableWorker docstrings: tightened. - schema.py committed_* column header and pre-WORKER_ROW_PROJECTION block comment: trimmed. - migration 0042 docstring: trimmed. - transitions.prune_old_data: dropped the redundant "tracker is source of truth" comment now that find_prunable's shape makes that obvious. - service._tasks_for_listing: dropped "loading every attempt was waste" retrospective; kept the function's actual contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

connectrpc has no client-side fallback when a server returns UNIMPLEMENTED on an unrecognized Content-Encoding, so a new client (send=zstd) talking to an older server (gzip-only) would fail every request during a rolling deployment. Iris RPC traffic is response-dominated (FetchLogs, list RPCs); request bodies are small in practice, so the compression on the way out wasn't buying much. Set send_compression=None at every client construction site (its default is gzip — None gives IdentityCompression). Accept-Encoding still advertises zstd then gzip so the response path keeps the allocation win and stays compatible with older peers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

) PR #5559 dropped the four committed_* columns from the worker upsert INSERT, relying on DEFAULT 0. Fresh DBs declare those columns NOT NULL DEFAULT 0, but DBs created before #4280 (e.g. marin-dev) still carry the original NOT-NULL-only DDL, so the INSERT fails and no worker can register. List the columns explicitly with literal 0s; ON CONFLICT DO UPDATE is unchanged so the scheduler's running totals are never clobbered on re-register.

…ion (#5569) PR #5559 dropped healthy/active columns from the workers SQLite table, breaking the status page's ExecuteRawQuery aggregation (ExecuteRawQuery 500: no such column: healthy). Switch the workers source to the ListWorkers Connect RPC, paginate at 1000/page, and aggregate client-side. CPU/memory totals now reflect healthy worker capacity rather than committed-subtracted "available", since the proto does not carry committed_*.

rjpower added the agent-generated Created by automation/agent label May 8, 2026

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread lib/iris/src/iris/cluster/controller/db.py

Comment thread lib/iris/src/iris/cluster/controller/service.py Outdated

rjpower and others added 5 commits May 8, 2026 02:26

[iris] strip change-narration comments from perf tune-up

10aa760

rjpower requested a review from yonromai May 8, 2026 17:17

rjpower enabled auto-merge (squash) May 8, 2026 17:17

Cleanup Makefile.

db5da8e

rjpower disabled auto-merge May 8, 2026 17:41

rjpower merged commit 040f975 into main May 8, 2026
35 checks passed

rjpower deleted the rjpower/iris-perf-tuneup branch May 8, 2026 17:41

rjpower mentioned this pull request May 8, 2026

[iris] Restore committed_* in worker upsert INSERT for legacy DBs #5568

Merged

rjpower mentioned this pull request May 8, 2026

[infra] Fix status page workers panel after in-memory liveness migration #5569

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] in-memory worker liveness, slim ListJobs, drop SnapshotView#5559

[iris] in-memory worker liveness, slim ListJobs, drop SnapshotView#5559
rjpower merged 7 commits into
mainfrom
rjpower/iris-perf-tuneup

rjpower commented May 8, 2026

Uh oh!

claude Bot commented May 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rjpower commented May 8, 2026

Uh oh!

claude Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 8, 2026 •

edited

Loading