Skip to content

Commit 11d9a0a

Browse files
committed
zephyr: log shard retries as histogram of attempts
Instead of logging every shard's attempt count, log the histogram of attempts -> shard count. Keeps retry visibility without the per-shard noise on large jobs with many preemptions.
1 parent f25986a commit 11d9a0a

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

lib/zephyr/src/zephyr/execution.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@
2626
import time
2727
import traceback
2828
import uuid
29-
from collections import defaultdict, deque
29+
from collections import Counter, defaultdict, deque
3030
from concurrent.futures import ThreadPoolExecutor
3131
from collections.abc import Callable, Iterable, Iterator
3232
from contextlib import suppress
@@ -513,7 +513,8 @@ def _log_status(self) -> None:
513513
dead,
514514
)
515515
if retried:
516-
logger.warning("[%s] Shards retried (shard: attempts): %s", self._execution_id, retried)
516+
attempts_histogram = dict(sorted(Counter(retried.values()).items()))
517+
logger.info("[%s] Shards retried (attempts: shard count): %s", self._execution_id, attempts_histogram)
517518

518519
def _record_shard_failure(self, worker_id: str, error_info: str | None = None) -> bool:
519520
"""Record a failure for the worker's in-flight shard. Must be called with lock held.

0 commit comments

Comments
 (0)