Skip to content

[iris] Add heartbeat and slice-lifecycle debug logging#4796

Merged
rjpower merged 1 commit intomainfrom
heartbeat-logging-debug
Apr 15, 2026
Merged

[iris] Add heartbeat and slice-lifecycle debug logging#4796
rjpower merged 1 commit intomainfrom
heartbeat-logging-debug

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 15, 2026

Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. The worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices.

Complements #4792 and #4793.

Per-task slow_log timers on submit_task (500ms) and synchronous kill
(2000ms) inside handle_heartbeat identify which task stalls a heartbeat.
Worker service heartbeat entrypoint gets an outer slow_log (1000ms) and
a DEBUG payload-size line to correlate with controller-side sync timing.
Slice ready/failed transitions log registered worker counts and ids to
expose partial bootstrap on large slices.

Complements #4792 and #4793.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 15, 2026
@rjpower rjpower enabled auto-merge (squash) April 15, 2026 22:12
@rjpower rjpower requested a review from dlwh April 15, 2026 22:12
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @rjpower's task in 1m 14s —— View job


PR Review

  • Read changed files and PR diff
  • Analyze debug logging additions
  • Post review feedback

Small, well-scoped debug/observability change. The additions are consistent with the existing slow_log/logger.info patterns already in handle_heartbeat and scaling_group. Nothing blocking — a few optional notes below.

Nits / optional

1. worker_ids logged as a raw Python listlib/iris/src/iris/cluster/controller/autoscaler/scaling_group.py:469-475 and :489-496

On large slices (hundreds of workers) the log line becomes a very long ['w-abc', 'w-def', ...] string. That's the stated intent (partial bootstrap diagnosis), but on the hot logger-info path it may be noisy. Two lightweight options if you want to keep it readable:

  • log just the count at INFO and the full list at DEBUG; or
  • truncate like worker_ids[:16] + "(+%d more)" when len > 16.

Either way, fine to defer — just flagging since "n_workers" is already present and the list is what will blow up the line length.

2. slice ready repeats on every callscaling_group.py:469

mark_slice_ready is called when transitioning to READY. If it gets called redundantly (already READY), the log will re-emit. Worth a quick check that this is only called on the transition edge, not every reconcile tick. A one-line guard on state.lifecycle != SliceLifecycleState.READY would make it strictly an edge log.

3. f-string in tight looplib/iris/src/iris/cluster/worker/worker.py:737, 751

with slow_log(logger, f"heartbeat submit_task[{run_req.task_id}]", threshold_ms=500):

The f-string is evaluated on every iteration regardless of whether the threshold trips. Overhead is trivial compared to submit_task, so this is purely stylistic — but if you want to match the lazy-eval pattern used elsewhere, slow_log could accept *args for %-style lazy formatting. Not worth changing just for this PR.

4. DEBUG payload log placementlib/iris/src/iris/cluster/worker/service.py:115-121

The debug log sits after chaos injection. If chaos raises, you lose the payload size info for that call. For correlation purposes you'd probably want it before chaos so every received request gets sized. Minor — the outer slow_log still fires via the context manager, so you'll at least see elapsed time.

LGTM

  • Moving list(state.worker_ids) capture inside the _slices_lock in mark_slice_failed is the right call — avoids racing the read against a later writer. Good catch.
  • Per-task slow_log thresholds (500ms submit, 2000ms sync kill) look well-calibrated relative to the outer handle_heartbeat 2000ms and heartbeat kill_tasks 5000ms budgets.
  • request.ByteSize() on the proto is the right API for correlating with controller-side sync timing.
    · branch: heartbeat-logging-debug

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25099356b1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/worker/service.py
@rjpower rjpower disabled auto-merge April 15, 2026 22:31
@rjpower rjpower enabled auto-merge (squash) April 15, 2026 22:36
@rjpower rjpower merged commit 46ecb2e into main Apr 15, 2026
45 of 46 checks passed
@rjpower rjpower deleted the heartbeat-logging-debug branch April 15, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants