Skip to content

[iris] Chunk fail_heartbeats_batch to release SQLite writer between groups#4845

Merged
rjpower merged 1 commit intomainfrom
iris-fail-heartbeats-chunk
Apr 16, 2026
Merged

[iris] Chunk fail_heartbeats_batch to release SQLite writer between groups#4845
rjpower merged 1 commit intomainfrom
iris-fail-heartbeats-chunk

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 16, 2026

Previously one transaction covered the entire failure batch, holding the SQLite writer for ~30s when a zone-wide failure reaped hundreds of workers and starving RegisterEndpoint, Register, LaunchJob, and apply_heartbeats_batch behind BEGIN IMMEDIATE. Commit per chunk (default 10) so concurrent writers can interleave; heartbeat failures are idempotent so partial progress on crash is safe. Also bumps the worker's controller-client timeout from 5s to 60s so workers ride out residual slow RPCs rather than abandoning them. Benchmark adds tail-latency A/B probes; x50 run shows max add_endpoint latency during a failure batch drop from 5.3s to 2.4s, drain_dispatch_all from 5.4s to 2.9s, batch wall time unchanged.

…roups

Previously one transaction covered the entire failure batch, holding the
SQLite writer for ~30s when a zone-wide failure reaped hundreds of workers
and starving RegisterEndpoint, Register, LaunchJob, and apply_heartbeats_batch
behind BEGIN IMMEDIATE. Commit per FAIL_HEARTBEATS_CHUNK_SIZE=10 workers so
concurrent writers can interleave; heartbeat failures are idempotent so
partial progress on crash is safe.

Bumps worker's controller-client timeout from 5s to 60s so workers ride out
the residual slow RPCs rather than abandoning them.

Benchmark script adds _measure_tail_latency / _measure_drain_tail_latency
probes with A/B (1-txn vs chunk=10) against a cloned snapshot. x50 run
shows max add_endpoint latency during a failure batch drop from 5.3s to
2.4s; drain_dispatch_all drops from 5.4s to 2.9s; batch wall time
unchanged.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 16, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

Claude finished @rjpower's task in 4m 3s —— View job


Code review complete

  • Check PR eligibility for review
  • Find relevant CLAUDE.md/AGENTS.md files
  • Summarize PR changes
  • Run parallel review agents (CLAUDE.md + bugs)
  • Validate flagged issues
  • Post review as summary comment

No issues found. Checked for bugs and CLAUDE.md compliance. Full review posted as a separate comment on this PR.

Highlights of what was verified:

  • Chunked loop in fail_heartbeats_batch covers each failure exactly once (no drop / double-count).
  • New imports resolve: TERMINAL_JOB_STATES (iris.cluster.types), EndpointRow (iris.cluster.controller.schema), DispatchBatch / FAIL_HEARTBEATS_CHUNK_SIZE (iris.cluster.controller.transitions).
  • Benchmark threading is race-free: each list is written by a single thread, read only after both join()s; stop is a threading.Event.
  • Idempotency claim holds: _record_heartbeat_failure guards on active = 1, so chunked commits + replays are safe.
    • Branch: iris-fail-heartbeats-chunk

@rjpower rjpower requested a review from ravwojdyla April 16, 2026 22:27
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

🤖 ### Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Notes on what was verified:

  • Chunk slicing in fail_heartbeats_batch covers each failure exactly once (no drop / double-count).
  • New imports resolve: TERMINAL_JOB_STATES at lib/iris/src/iris/cluster/types.py:590, EndpointRow at lib/iris/src/iris/cluster/controller/schema.py:1636, DispatchBatch / FAIL_HEARTBEATS_CHUNK_SIZE at lib/iris/src/iris/cluster/controller/transitions.py.
  • Benchmark threading is race-free: each list is written by a single thread and only read after both join()s; stop is a threading.Event.
  • Idempotency claim holds: _record_heartbeat_failure guards on active = 1, so replays are safe.

View job run

@rjpower rjpower merged commit 26c5b61 into main Apr 16, 2026
43 checks passed
@rjpower rjpower deleted the iris-fail-heartbeats-chunk branch April 16, 2026 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants