[iris] in-memory worker liveness, slim ListJobs, drop SnapshotView#5559
Merged
Conversation
Move worker last_heartbeat_ms/healthy/active/consecutive_failures/committed_* out of the workers SQLite columns into in-memory WorkerHealthTracker / WorkerCommitTracker, eliminating a per-heartbeat writer transaction that was bloating the WAL and starving dashboard reads. Migration 0042 drops the now dormant columns; ListJobs no longer materializes a 26k-row snapshot, slims JobRow 22->13 fields, and serves directly from indexed SQL. Fixes a 212s ListJobs symptom seen on prod.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2acba3ab02
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Worker liveness state (healthy/active/consecutive_failures/last_heartbeat, committed_*) now lives natively on a single public WorkerLiveness dataclass inside WorkerHealthTracker. Readers query the tracker via tracker.liveness() or tracker.liveness_many() instead of relying on the _hydrate_worker_detail overlay seam. WorkerCommitTracker is folded into WorkerHealthTracker so a single mutex covers every per-worker transient signal. Stale ExtraField defaults for healthy / active / consecutive_failures / last_heartbeat / committed_* / available_* are removed from WORKER_ROW_PROJECTION and WORKER_DETAIL_PROJECTION; WorkerRow / WorkerDetailRow now carry only durable identity columns plus the one real ExtraField (attributes, sourced from worker_attributes). A new SchedulableWorker dataclass (in db.py) bundles the durable WorkerRow columns with the live committed totals for the scheduler hot path, replacing the hydrate-then-decorate dance in healthy_active_workers_with_attributes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…istTasks timing The previous tune-up moved both transient liveness and committed_* resource totals out of the workers SQLite row into WorkerHealthTracker. Only the former belongs there: liveness is heartbeat-driven and rewriting it on every ping would serialize through the writer connection. committed_* is durable scheduling state — only mutated by the scheduler under a write transaction — and must survive controller restart so worker capacity stays correct. Fix 1: restore the four committed_* columns on the workers table. Migration 0042 now drops only the four transient liveness columns. WorkerStore's add_committed_resources / decommit_resources take a transaction cursor and issue SQL UPDATEs (the same MAX(0, …) idiom migration 0039 already uses). SchedulableWorker reads committed_* straight from WorkerRow and drops the unused healthy field; the scheduler's "if w.healthy" filter goes with it since healthy_active_workers_with_attributes already filters. Fix 2: a fresh in-memory tracker would otherwise hide every persisted worker from the scheduler until each one pinged back. ControllerStore now seeds liveness from the workers table at boot (and on checkpoint reopen) by issuing a heartbeat for every persisted worker_id. Added a regression test that constructs a store against a DB with rows and asserts both that liveness is populated and that healthy_active_workers_with_attributes returns those workers immediately. Fix 3: _tasks_for_listing left task.attempts empty, so task_to_proto's started_at / finished_at and retry status text never made it into the ListTasks response. The fixed query JOINs only the current attempt via "(ta.task_id, ta.attempt_id) IN (SELECT t.task_id, t.current_attempt_id …)" — at most one row per task, so no IN-clause blowup on long histories. A new test asserts the proto's started_at is populated and proto.attempts length is exactly 1. Goldens regenerated to include the four committed_* columns now visible in worker dumps.
…re/tracker WorkerStore.find_prunable's second branch (DB rows missing a tracker entry, guarded by a 5-minute boot-time grace window) existed only to defend against "checkpoint restore that never re-registered" — already handled by ControllerStore._seed_liveness_from_workers running both at construction and on db.replace_from(). Drop the branch, the ORPHAN_PRUNE_GRACE_MS constant, and _boot_ms; find_prunable now just scans the tracker for stale unhealthy/inactive entries. Every workers row has a tracker entry by construction (seeded at boot, registered on commit of upsert, removed on commit of remove). Also trim change-narration that crept back in: - worker_health.py module docstring: drop the "durable state lives in the SQLite workers row" non-sequitur and the "crash recovery: ControllerStore seeds…" paragraph (relocated to _seed_liveness_from_workers). - WorkerStore class/upsert docstrings: corrected the stale claim that committed_* lives in the tracker. - WorkerRow/WorkerDetailRow/SchedulableWorker docstrings: tightened. - schema.py committed_* column header and pre-WORKER_ROW_PROJECTION block comment: trimmed. - migration 0042 docstring: trimmed. - transitions.prune_old_data: dropped the redundant "tracker is source of truth" comment now that find_prunable's shape makes that obvious. - service._tasks_for_listing: dropped "loading every attempt was waste" retrospective; kept the function's actual contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
connectrpc has no client-side fallback when a server returns UNIMPLEMENTED on an unrecognized Content-Encoding, so a new client (send=zstd) talking to an older server (gzip-only) would fail every request during a rolling deployment. Iris RPC traffic is response-dominated (FetchLogs, list RPCs); request bodies are small in practice, so the compression on the way out wasn't buying much. Set send_compression=None at every client construction site (its default is gzip — None gives IdentityCompression). Accept-Encoding still advertises zstd then gzip so the response path keeps the allocation win and stays compatible with older peers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rjpower
added a commit
that referenced
this pull request
May 8, 2026
) PR #5559 dropped the four committed_* columns from the worker upsert INSERT, relying on DEFAULT 0. Fresh DBs declare those columns NOT NULL DEFAULT 0, but DBs created before #4280 (e.g. marin-dev) still carry the original NOT-NULL-only DDL, so the INSERT fails and no worker can register. List the columns explicitly with literal 0s; ON CONFLICT DO UPDATE is unchanged so the scheduler's running totals are never clobbered on re-register.
rjpower
added a commit
that referenced
this pull request
May 8, 2026
…ion (#5569) PR #5559 dropped healthy/active columns from the workers SQLite table, breaking the status page's ExecuteRawQuery aggregation (ExecuteRawQuery 500: no such column: healthy). Switch the workers source to the ListWorkers Connect RPC, paginate at 1000/page, and aggregate client-side. CPU/memory totals now reflect healthy worker capacity rather than committed-subtracted "available", since the proto does not carry committed_*.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Move worker last_heartbeat_ms/healthy/active/consecutive_failures/committed_* out of the workers SQLite columns into in-memory WorkerHealthTracker / WorkerCommitTracker, eliminating a per-heartbeat writer transaction that was bloating the WAL and starving dashboard reads. Migration 0042 drops the now dormant columns; ListJobs no longer materializes a 26k-row snapshot, slims JobRow 22 to 13 fields, and serves directly from indexed SQL. Fixes a 212s ListJobs symptom seen on prod.