Skip to content

Log shard retires as debug#4718

Merged
ravwojdyla merged 1 commit intomainfrom
rav-shards-debug
Apr 14, 2026
Merged

Log shard retires as debug#4718
ravwojdyla merged 1 commit intomainfrom
rav-shards-debug

Conversation

@ravwojdyla
Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla requested a review from rjpower April 14, 2026 02:14
@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 14, 2026

I'm torn on this, it feels important to surface. You're seeing this because of lots of preemptions I guess?

We could rate limit to N per minute with the token bucket rate limiter if you wanted perhaps?

@ravwojdyla
Copy link
Copy Markdown
Contributor Author

ravwojdyla commented Apr 14, 2026

@rjpower loads of preemption and large number of shards.

@ravwojdyla
Copy link
Copy Markdown
Contributor Author

@rjpower what part of that log is important?

  • The fact that there are any retries?
  • Retry count per specific shard?

How about we change it to, count of shards by retry. So log like:

Shards retried (shard: attempts): {0: 2, 1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 2, 24: 2, 25: 2, 26: 2, 27: 2, 28: 1, 29: 2, 30: 2, 31: 2, 32: 2, 33: 2, 34: 2, 35: 2, 36: 2, 37: 2, 38: 2, 39: 2, 40: 2, 41: 2, 42: 1}

Would change into:

Shards retried (retry: count of shards): 2: 42

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 14, 2026

Just a notification that we're having failures seems useful. I'm fine if we compress them, since in theory the raw info is on Iris.

Instead of logging every shard's attempt count, log the histogram of
attempts -> shard count. Keeps retry visibility without the per-shard
noise on large jobs with many preemptions.
@ravwojdyla ravwojdyla enabled auto-merge (squash) April 14, 2026 22:38
@ravwojdyla ravwojdyla merged commit 8305dc8 into main Apr 14, 2026
37 checks passed
@ravwojdyla ravwojdyla deleted the rav-shards-debug branch April 14, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants