Skip to content

[iris] in-memory worker liveness, slim ListJobs, drop SnapshotView#5559

Merged
rjpower merged 7 commits into
mainfrom
rjpower/iris-perf-tuneup
May 8, 2026
Merged

[iris] in-memory worker liveness, slim ListJobs, drop SnapshotView#5559
rjpower merged 7 commits into
mainfrom
rjpower/iris-perf-tuneup

Conversation

@rjpower

@rjpower rjpower commented May 8, 2026

Copy link
Copy Markdown
Collaborator

Move worker last_heartbeat_ms/healthy/active/consecutive_failures/committed_* out of the workers SQLite columns into in-memory WorkerHealthTracker / WorkerCommitTracker, eliminating a per-heartbeat writer transaction that was bloating the WAL and starving dashboard reads. Migration 0042 drops the now dormant columns; ListJobs no longer materializes a 26k-row snapshot, slims JobRow 22 to 13 fields, and serves directly from indexed SQL. Fixes a 212s ListJobs symptom seen on prod.

Move worker last_heartbeat_ms/healthy/active/consecutive_failures/committed_*
out of the workers SQLite columns into in-memory WorkerHealthTracker /
WorkerCommitTracker, eliminating a per-heartbeat writer transaction that was
bloating the WAL and starving dashboard reads. Migration 0042 drops the now
dormant columns; ListJobs no longer materializes a 26k-row snapshot, slims
JobRow 22->13 fields, and serves directly from indexed SQL. Fixes a 212s
ListJobs symptom seen on prod.
@rjpower rjpower added the agent-generated Created by automation/agent label May 8, 2026
@claude

claude Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Code Review

  • Skip-check (closed/draft/already-reviewed)
  • Locate relevant CLAUDE.md / AGENTS.md files
  • Summarize PR changes
  • Run 4 parallel reviewers (2x AGENTS compliance, 2x bug-hunters)
  • Validate each flagged issue
  • Post inline comments for confirmed issues + summary

View job run

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2acba3ab02

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/controller/db.py
Comment thread lib/iris/src/iris/cluster/controller/service.py Outdated
rjpower and others added 5 commits May 8, 2026 02:26
Worker liveness state (healthy/active/consecutive_failures/last_heartbeat,
committed_*) now lives natively on a single public WorkerLiveness dataclass
inside WorkerHealthTracker. Readers query the tracker via tracker.liveness()
or tracker.liveness_many() instead of relying on the _hydrate_worker_detail
overlay seam.

WorkerCommitTracker is folded into WorkerHealthTracker so a single mutex
covers every per-worker transient signal. Stale ExtraField defaults for
healthy / active / consecutive_failures / last_heartbeat / committed_* /
available_* are removed from WORKER_ROW_PROJECTION and WORKER_DETAIL_PROJECTION;
WorkerRow / WorkerDetailRow now carry only durable identity columns plus the
one real ExtraField (attributes, sourced from worker_attributes).

A new SchedulableWorker dataclass (in db.py) bundles the durable WorkerRow
columns with the live committed totals for the scheduler hot path, replacing
the hydrate-then-decorate dance in healthy_active_workers_with_attributes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…istTasks timing

The previous tune-up moved both transient liveness and committed_* resource
totals out of the workers SQLite row into WorkerHealthTracker. Only the
former belongs there: liveness is heartbeat-driven and rewriting it on every
ping would serialize through the writer connection. committed_* is durable
scheduling state — only mutated by the scheduler under a write transaction
— and must survive controller restart so worker capacity stays correct.

Fix 1: restore the four committed_* columns on the workers table. Migration
0042 now drops only the four transient liveness columns. WorkerStore's
add_committed_resources / decommit_resources take a transaction cursor and
issue SQL UPDATEs (the same MAX(0, …) idiom migration 0039 already uses).
SchedulableWorker reads committed_* straight from WorkerRow and drops the
unused healthy field; the scheduler's "if w.healthy" filter goes with it
since healthy_active_workers_with_attributes already filters.

Fix 2: a fresh in-memory tracker would otherwise hide every persisted
worker from the scheduler until each one pinged back. ControllerStore now
seeds liveness from the workers table at boot (and on checkpoint reopen)
by issuing a heartbeat for every persisted worker_id. Added a regression
test that constructs a store against a DB with rows and asserts both that
liveness is populated and that healthy_active_workers_with_attributes
returns those workers immediately.

Fix 3: _tasks_for_listing left task.attempts empty, so task_to_proto's
started_at / finished_at and retry status text never made it into the
ListTasks response. The fixed query JOINs only the current attempt via
"(ta.task_id, ta.attempt_id) IN (SELECT t.task_id, t.current_attempt_id
…)" — at most one row per task, so no IN-clause blowup on long histories.
A new test asserts the proto's started_at is populated and proto.attempts
length is exactly 1.

Goldens regenerated to include the four committed_* columns now visible in
worker dumps.
…re/tracker

WorkerStore.find_prunable's second branch (DB rows missing a tracker entry,
guarded by a 5-minute boot-time grace window) existed only to defend against
"checkpoint restore that never re-registered" — already handled by
ControllerStore._seed_liveness_from_workers running both at construction and
on db.replace_from(). Drop the branch, the ORPHAN_PRUNE_GRACE_MS constant,
and _boot_ms; find_prunable now just scans the tracker for stale
unhealthy/inactive entries. Every workers row has a tracker entry by
construction (seeded at boot, registered on commit of upsert, removed on
commit of remove).

Also trim change-narration that crept back in:

- worker_health.py module docstring: drop the "durable state lives in the
  SQLite workers row" non-sequitur and the "crash recovery: ControllerStore
  seeds…" paragraph (relocated to _seed_liveness_from_workers).
- WorkerStore class/upsert docstrings: corrected the stale claim that
  committed_* lives in the tracker.
- WorkerRow/WorkerDetailRow/SchedulableWorker docstrings: tightened.
- schema.py committed_* column header and pre-WORKER_ROW_PROJECTION block
  comment: trimmed.
- migration 0042 docstring: trimmed.
- transitions.prune_old_data: dropped the redundant "tracker is source of
  truth" comment now that find_prunable's shape makes that obvious.
- service._tasks_for_listing: dropped "loading every attempt was waste"
  retrospective; kept the function's actual contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
connectrpc has no client-side fallback when a server returns UNIMPLEMENTED
on an unrecognized Content-Encoding, so a new client (send=zstd) talking to
an older server (gzip-only) would fail every request during a rolling
deployment. Iris RPC traffic is response-dominated (FetchLogs, list RPCs);
request bodies are small in practice, so the compression on the way out
wasn't buying much.

Set send_compression=None at every client construction site (its default is
gzip — None gives IdentityCompression). Accept-Encoding still advertises
zstd then gzip so the response path keeps the allocation win and stays
compatible with older peers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rjpower rjpower requested a review from yonromai May 8, 2026 17:17
@rjpower rjpower enabled auto-merge (squash) May 8, 2026 17:17
@rjpower rjpower disabled auto-merge May 8, 2026 17:41
@rjpower rjpower merged commit 040f975 into main May 8, 2026
35 checks passed
@rjpower rjpower deleted the rjpower/iris-perf-tuneup branch May 8, 2026 17:41
rjpower added a commit that referenced this pull request May 8, 2026
)

PR #5559 dropped the four committed_* columns from the worker upsert
INSERT, relying on DEFAULT 0. Fresh DBs declare those columns NOT NULL
DEFAULT 0, but DBs created before #4280 (e.g. marin-dev) still carry the
original NOT-NULL-only DDL, so the INSERT fails and no worker can
register. List the columns explicitly with literal 0s; ON CONFLICT DO
UPDATE is unchanged so the scheduler's running totals are never
clobbered on re-register.
rjpower added a commit that referenced this pull request May 8, 2026
…ion (#5569)

PR #5559 dropped healthy/active columns from the workers SQLite table,
breaking the status page's ExecuteRawQuery aggregation (ExecuteRawQuery
500: no such column: healthy). Switch the workers source to the
ListWorkers Connect RPC, paginate at 1000/page, and aggregate
client-side. CPU/memory totals now reflect healthy worker capacity
rather than committed-subtracted "available", since the proto does not
carry committed_*.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant