-
Notifications
You must be signed in to change notification settings - Fork 644
Add metrics and error logging if the MEL FSM is stuck in same state for a long time #3610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics and error logging if the MEL FSM is stuck in same state for a long time #3610
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## raul/mel-inbox-reading #3610 +/- ##
==========================================================
+ Coverage 24.31% 24.34% +0.03%
==========================================================
Files 395 395
Lines 59767 59773 +6
==========================================================
+ Hits 14532 14553 +21
+ Misses 43003 42976 -27
- Partials 2232 2244 +12 🚀 New features to boost your workflow:
|
arbnode/mel/runner/mel.go
Outdated
f.Int(prefix+".delayed-message-backlog-capacity", DefaultMessageExtractionConfig.DelayedMessageBacklogCapacity, "target capacity of the delayed message backlog") | ||
f.Uint64(prefix+".blocks-to-prefetch", DefaultMessageExtractionConfig.BlocksToPrefetch, "the number of blocks to prefetch relevant logs from") | ||
f.String(prefix+".read-mode", DefaultMessageExtractionConfig.ReadMode, "mode to only read latest or safe or finalized L1 blocks. Enabling safe or finalized disables feed input and output. Defaults to latest. Takes string input, valid strings- latest, safe, finalized") | ||
f.Uint64(prefix+".stall-tolerance", DefaultMessageExtractionConfig.StallTolerance, "max number of times the MEL fsm is allowed to be stuck in the same state before logging an error and firing the ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shorten description, max times the MEL fsm is allowed to be stuck without logging error
is plenty.
arbnode/mel/runner/mel.go
Outdated
) | ||
|
||
var ( | ||
stuckFSMIndicatingGauge = metrics.NewRegisteredGauge("arb/mel/stuck", nil) // 1-stuck, 0-not_stuck. TODO: once this is merged into master notify SRE to create an alert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO has nothing to do with source code, should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds a way to detect if MEL has been stuck in the same state for a long time by introducing a new config option
which represents the max number of times MEL is allowed to be stuck before setting the metric
arb/mel/stuck
(gauge) to 1 and an error log is emitted as well. This PR also does cleanup of removingretryInterval
field of mel struct and replacing that with theRetryInterval
field fromMessageExtractionConfig
.Resolves NIT-3392