-
Notifications
You must be signed in to change notification settings - Fork 234
fix: health endpoint now detects stopped block production #2800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks. Changes: - Updated HealthServer to accept store, config, and logger dependencies - Implemented block production monitoring in the Livez endpoint: * For aggregator nodes, checks if LastBlockTime is recent * Returns WARN if block production is slow (> 3x block time) * Returns FAIL if block production has stopped (> 5x block time) * Uses LazyBlockInterval for lazy mode aggregators * Non-aggregator nodes continue to return PASS - Added constants for health check thresholds: * healthCheckWarnMultiplier = 3 * healthCheckFailMultiplier = 5 - Added comprehensive unit tests covering all scenarios: Server tests (pkg/rpc/server/server_test.go): * Non-aggregator nodes * Aggregator with no blocks * Aggregator with recent blocks (PASS) * Aggregator with slow production (WARN) * Aggregator with stopped production (FAIL) * Lazy aggregator with correct thresholds * Error handling Client tests (pkg/rpc/client/client_test.go): * Non-aggregator returns PASS * Aggregator with recent blocks returns PASS * Aggregator with slow block production returns WARN * Aggregator with stopped block production returns FAIL - Updated setupTestServer to pass new dependencies - Added createCustomTestServer helper for testing with custom configs The thresholds are configurable based on the node's BlockTime or LazyBlockInterval settings, making the health check adaptive to different node configurations. Fixes #2643
|
The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).
|
|
Claude finished @randygrok's task —— View job Comprehensive Code ReviewI've completed a thorough review of the health endpoint implementation. Here's my analysis: Executive SummaryThis PR successfully implements health monitoring for aggregator nodes by detecting stopped block production. The implementation is sound, but there are important considerations around threshold configuration and scope that should be addressed. 1. Code Quality & Architecture✅ StrengthsClean separation of concerns: The health check logic is well-organized with HTTP endpoint handlers in Appropriate use of Go idioms:
Interface design: The
|
Summary of ChangesHello @randygrok, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves issue #2643 by significantly improving the accuracy of the health endpoint for aggregator nodes. Previously, the endpoint might incorrectly report a healthy status even when a node had ceased block production. The updated logic now checks the time since the last block was produced against dynamically calculated warning and failure thresholds, derived from the node's configured block time. This ensures that operational issues related to block generation are promptly detected and reflected in the health status, providing better visibility into node performance and stability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the issue of the health endpoint not detecting stopped block production for aggregator nodes. The core logic is sound, and the changes are accompanied by a comprehensive set of unit tests and a new integration test. My feedback includes suggestions to improve the robustness of the integration test by replacing fixed-duration sleeps with polling, refactoring duplicated test setup code for better maintainability, and adjusting a log level to better reflect the severity of a health check failure.
39dc439 to
819b015
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2800 +/- ##
==========================================
+ Coverage 62.61% 62.80% +0.18%
==========================================
Files 82 82
Lines 7334 7371 +37
==========================================
+ Hits 4592 4629 +37
Misses 2197 2197
Partials 545 545
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Remove createCustomTestServer function which was redundant after making setupTestServer accept optional config parameter. This eliminates 36 lines of duplicated server setup code. Changes: - Make setupTestServer accept variadic config parameter - Update all test cases to use setupTestServer directly - Remove createCustomTestServer function entirely Result: -21 net lines of code, improved maintainability, cleaner API.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…n test Replace fixed duration time.Sleep calls with require.Eventually polling to make the health endpoint integration test more robust and less flaky. Changes: - Use require.Eventually to poll for health state transitions - Poll every 100ms instead of fixed sleeps - Generous timeouts (5s and 10s) that terminate early on success - Better error handling during polling Benefits: - More resilient to timing variations in CI/CD environments - Faster test execution (completes as soon as conditions are met) - Eliminates magic numbers (1700ms compensation) - Expresses intent clearly (wait until condition is met) - Non-flaky (tested 3x consecutively)
pkg/rpc/server/server.go
Outdated
|
|
||
| // healthCheckFailMultiplier is the multiplier for block time to determine FAIL threshold | ||
| // If no block has been produced in (blockTime * healthCheckFailMultiplier), return FAIL | ||
| healthCheckFailMultiplier = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should double check with @auricom if this is good or too fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean for chains with a block time of 250ms, a node can only tolerate approximately 1 second of random hang time before the sequencer would be flagged as an eligible candidate for hard process termination.
We might improve safety by using an absolute timeout threshold (in seconds) rather than a configuration-relative value. This would provide more predictable failure detection across different chain configurations.
I'd say 3-5 seconds would be ok BUT I believe we should hear @alpe's perspective on what constitutes an appropriate timeout threshold before forcing a sequencer in an HA cluster to be considered stale, so leadership can be transferred to another.
pkg/rpc/server/server.go
Outdated
| } | ||
| } | ||
|
|
||
| // For non-aggregator nodes or if checks pass, return healthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have something with syncing for non aggregator nodes as well. maybe a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK, we should make sure this covers @auricom's needs before merging
|
A healthcheck endpoint is used by infrastructure automation to determine if a process can respond to client requests. AFAIK based on this PR, the existing endpoint did not perform substantive validation. For a node to be considered healthy, the following conditions should be met: 1- The RPC server is operational and responds correctly to basic client queries per specifications I cannot verify whether this change effectively validates items 1 and 4. And I may be overly cautious, so I would appreciate your expertise on these points 🧑💻 |
dismissing so we can cover all paths that claude mentioned
|
thanks @auricom , let me check those points. By your definition only point 2 is covered |
… HTTP endpoint - Removed the HealthService and its related proto definitions. - Implemented a new HTTP health check endpoint at `/health/live`. - Updated the Client to use the new HTTP health check instead of the gRPC call. - Enhanced health check logic to return PASS, WARN, or FAIL based on block production status. - Modified tests to validate the new health check endpoint and its responses. - Updated server to register custom HTTP endpoints including the new health check.
This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks.
Closes #2643
Overview