[iris] Add heartbeat and slice-lifecycle debug logging by rjpower · Pull Request #4796 · marin-community/marin

rjpower · 2026-04-15T22:12:08Z

Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. The worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices.

Complements #4792 and #4793.

Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. Worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices. Complements #4792 and #4793.

claude · 2026-04-15T22:12:43Z

Claude finished @rjpower's task in 1m 14s —— View job

PR Review

Read changed files and PR diff
Analyze debug logging additions
Post review feedback

Small, well-scoped debug/observability change. The additions are consistent with the existing slow_log/logger.info patterns already in handle_heartbeat and scaling_group. Nothing blocking — a few optional notes below.

Nits / optional

1. worker_ids logged as a raw Python list — lib/iris/src/iris/cluster/controller/autoscaler/scaling_group.py:469-475 and :489-496

On large slices (hundreds of workers) the log line becomes a very long ['w-abc', 'w-def', ...] string. That's the stated intent (partial bootstrap diagnosis), but on the hot logger-info path it may be noisy. Two lightweight options if you want to keep it readable:

log just the count at INFO and the full list at DEBUG; or
truncate like worker_ids[:16] + "(+%d more)" when len > 16.

Either way, fine to defer — just flagging since "n_workers" is already present and the list is what will blow up the line length.

2. slice ready repeats on every call — scaling_group.py:469

mark_slice_ready is called when transitioning to READY. If it gets called redundantly (already READY), the log will re-emit. Worth a quick check that this is only called on the transition edge, not every reconcile tick. A one-line guard on state.lifecycle != SliceLifecycleState.READY would make it strictly an edge log.

3. f-string in tight loop — lib/iris/src/iris/cluster/worker/worker.py:737, 751

with slow_log(logger, f"heartbeat submit_task[{run_req.task_id}]", threshold_ms=500):

The f-string is evaluated on every iteration regardless of whether the threshold trips. Overhead is trivial compared to submit_task, so this is purely stylistic — but if you want to match the lazy-eval pattern used elsewhere, slow_log could accept *args for %-style lazy formatting. Not worth changing just for this PR.

4. DEBUG payload log placement — lib/iris/src/iris/cluster/worker/service.py:115-121

The debug log sits after chaos injection. If chaos raises, you lose the payload size info for that call. For correlation purposes you'd probably want it before chaos so every received request gets sized. Minor — the outer slow_log still fires via the context manager, so you'll at least see elapsed time.

LGTM

Moving list(state.worker_ids) capture inside the _slices_lock in mark_slice_failed is the right call — avoids racing the read against a later writer. Good catch.
Per-task slow_log thresholds (500ms submit, 2000ms sync kill) look well-calibrated relative to the outer handle_heartbeat 2000ms and heartbeat kill_tasks 5000ms budgets.
request.ByteSize() on the proto is the right API for correlating with controller-side sync timing.
· branch: heartbeat-logging-debug

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25099356b1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

rjpower added the agent-generated Created by automation/agent label Apr 15, 2026

rjpower enabled auto-merge (squash) April 15, 2026 22:12

rjpower requested a review from dlwh April 15, 2026 22:12

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

Comment thread lib/iris/src/iris/cluster/worker/service.py

rjpower disabled auto-merge April 15, 2026 22:31

rjpower enabled auto-merge (squash) April 15, 2026 22:36

dlwh approved these changes Apr 15, 2026

View reviewed changes

rjpower merged commit 46ecb2e into main Apr 15, 2026
45 of 46 checks passed

rjpower deleted the heartbeat-logging-debug branch April 15, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Add heartbeat and slice-lifecycle debug logging#4796

[iris] Add heartbeat and slice-lifecycle debug logging#4796
rjpower merged 1 commit intomainfrom
heartbeat-logging-debug

rjpower commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Nits / optional

LGTM

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 15, 2026 •

edited

Loading