[iris] Add heartbeat and slice-lifecycle debug logging#4796
Conversation
Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. Worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices. Complements #4792 and #4793.
|
Claude finished @rjpower's task in 1m 14s —— View job PR Review
Small, well-scoped debug/observability change. The additions are consistent with the existing Nits / optional1. On large slices (hundreds of workers) the log line becomes a very long
Either way, fine to defer — just flagging since "n_workers" is already present and the list is what will blow up the line length. 2.
3. f-string in tight loop — with slow_log(logger, f"heartbeat submit_task[{run_req.task_id}]", threshold_ms=500):The f-string is evaluated on every iteration regardless of whether the threshold trips. Overhead is trivial compared to 4. DEBUG payload log placement — The debug log sits after chaos injection. If chaos raises, you lose the payload size info for that call. For correlation purposes you'd probably want it before chaos so every received request gets sized. Minor — the outer LGTM
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25099356b1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Per-task slow_log timers on submit_task (500ms) and synchronous kill (2000ms) inside handle_heartbeat identify which task stalls a heartbeat. The worker service heartbeat entrypoint gets an outer slow_log (1000ms) and a DEBUG payload-size line to correlate with controller-side sync timing. Slice ready/failed transitions log registered worker counts and ids to expose partial bootstrap on large slices.
Complements #4792 and #4793.