Skip to content

Conversation

ganeshvanahalli
Copy link
Contributor

This PR adds a way to detect if MEL has been stuck in the same state for a long time by introducing a new config option

--node.message-extraction.stall-tolerance

which represents the max number of times MEL is allowed to be stuck before setting the metric arb/mel/stuck (gauge) to 1 and an error log is emitted as well. This PR also does cleanup of removing retryInterval field of mel struct and replacing that with the RetryInterval field from MessageExtractionConfig.

Resolves NIT-3392

Copy link

codecov bot commented Sep 11, 2025

Codecov Report

❌ Patch coverage is 19.44444% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 24.34%. Comparing base (0815853) to head (5b48725).

Additional details and impacted files
@@                    Coverage Diff                     @@
##           raul/mel-inbox-reading    #3610      +/-   ##
==========================================================
+ Coverage                   24.31%   24.34%   +0.03%     
==========================================================
  Files                         395      395              
  Lines                       59767    59773       +6     
==========================================================
+ Hits                        14532    14553      +21     
+ Misses                      43003    42976      -27     
- Partials                     2232     2244      +12     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

f.Int(prefix+".delayed-message-backlog-capacity", DefaultMessageExtractionConfig.DelayedMessageBacklogCapacity, "target capacity of the delayed message backlog")
f.Uint64(prefix+".blocks-to-prefetch", DefaultMessageExtractionConfig.BlocksToPrefetch, "the number of blocks to prefetch relevant logs from")
f.String(prefix+".read-mode", DefaultMessageExtractionConfig.ReadMode, "mode to only read latest or safe or finalized L1 blocks. Enabling safe or finalized disables feed input and output. Defaults to latest. Takes string input, valid strings- latest, safe, finalized")
f.Uint64(prefix+".stall-tolerance", DefaultMessageExtractionConfig.StallTolerance, "max number of times the MEL fsm is allowed to be stuck in the same state before logging an error and firing the ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorten description, max times the MEL fsm is allowed to be stuck without logging error is plenty.

)

var (
stuckFSMIndicatingGauge = metrics.NewRegisteredGauge("arb/mel/stuck", nil) // 1-stuck, 0-not_stuck. TODO: once this is merged into master notify SRE to create an alert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO has nothing to do with source code, should be removed

Copy link
Member

@joshuacolvin0 joshuacolvin0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joshuacolvin0 joshuacolvin0 merged commit 06ab4f1 into raul/mel-inbox-reading Sep 22, 2025
13 of 14 checks passed
@joshuacolvin0 joshuacolvin0 deleted the add-alerting-stallingmel branch September 22, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants