Skip to content

Conversation

@randygrok
Copy link
Contributor

This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks.

Closes #2643

Overview

This commit fixes issue #2643 where the health endpoint still reports
OK when a node has stopped producing blocks.

Changes:
- Updated HealthServer to accept store, config, and logger dependencies
- Implemented block production monitoring in the Livez endpoint:
  * For aggregator nodes, checks if LastBlockTime is recent
  * Returns WARN if block production is slow (> 3x block time)
  * Returns FAIL if block production has stopped (> 5x block time)
  * Uses LazyBlockInterval for lazy mode aggregators
  * Non-aggregator nodes continue to return PASS
- Added constants for health check thresholds:
  * healthCheckWarnMultiplier = 3
  * healthCheckFailMultiplier = 5
- Added comprehensive unit tests covering all scenarios:
  Server tests (pkg/rpc/server/server_test.go):
  * Non-aggregator nodes
  * Aggregator with no blocks
  * Aggregator with recent blocks (PASS)
  * Aggregator with slow production (WARN)
  * Aggregator with stopped production (FAIL)
  * Lazy aggregator with correct thresholds
  * Error handling
  Client tests (pkg/rpc/client/client_test.go):
  * Non-aggregator returns PASS
  * Aggregator with recent blocks returns PASS
  * Aggregator with slow block production returns WARN
  * Aggregator with stopped block production returns FAIL
- Updated setupTestServer to pass new dependencies
- Added createCustomTestServer helper for testing with custom configs

The thresholds are configurable based on the node's BlockTime or
LazyBlockInterval settings, making the health check adaptive to
different node configurations.

Fixes #2643
@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed❌ failed (1)Nov 5, 2025, 9:34 AM

@claude
Copy link
Contributor

claude bot commented Nov 2, 2025

Claude finished @randygrok's task —— View job


Comprehensive Code Review

I've completed a thorough review of the health endpoint implementation. Here's my analysis:


Executive Summary

This PR successfully implements health monitoring for aggregator nodes by detecting stopped block production. The implementation is sound, but there are important considerations around threshold configuration and scope that should be addressed.


1. Code Quality & Architecture

✅ Strengths

Clean separation of concerns: The health check logic is well-organized with HTTP endpoint handlers in http.go (lines 32-85) and proper service wiring in server.go.

Appropriate use of Go idioms:

  • Proper use of http.Handler patterns
  • Clean error handling with structured logging
  • Good use of constants for multipliers (pkg/rpc/server/http.go:14-21)

Interface design: The BestKnownHeightProvider function type (pkg/rpc/server/http.go:26) provides good abstraction for the readiness endpoint.

⚠️ Areas for Improvement

Hardcoded multipliers lack flexibility (pkg/rpc/server/http.go:17-21):

const (
    healthCheckWarnMultiplier = 3
    healthCheckFailMultiplier = 5
)

Issue: For chains with very fast block times (e.g., 250ms), this means:

  • WARN threshold: 750ms (3 × 250ms)
  • FAIL threshold: 1250ms (5 × 250ms)

As @auricom correctly noted, this provides only ~1 second of tolerance before a node is marked for termination in an HA cluster. For chains with slow block times (e.g., 15s), the thresholds become 45s/75s, which may be too lenient.

Recommendation: Consider making these configurable or using absolute thresholds with sensible defaults:

// Suggested approach: Use absolute minimum thresholds
warnThreshold := max(blockTime * 3, 3 * time.Second)
failThreshold := max(blockTime * 5, 5 * time.Second)

This ensures reasonable behavior across different block time configurations while maintaining the relative scaling for longer block times.


2. Error Handling & Resilience

✅ Strengths

Proper error handling (pkg/rpc/server/http.go:37-42): Errors from GetState are properly logged and return FAIL with 503 Service Unavailable.

Graceful degradation: Non-aggregator nodes and nodes with no blocks yet return OK (lines 80-84), which is appropriate for startup scenarios.

Appropriate log levels: Uses Error() for FAIL states and Warn() for WARN states.

⚠️ Minor Issue

Log level for FAIL state (pkg/rpc/server/http.go:60-66): While the gemini-code-assist review suggested using Error instead of Warn for FAIL states, I see this was actually already implemented correctly - the code uses logger.Error() for FAIL states. Good!


3. Security

✅ No Security Concerns

  • No sensitive data exposed in logs or responses
  • Proper validation of state data before use
  • No authentication required for health endpoints (appropriate for health checks)
  • Read-only operations with no state modification

4. Performance & Resource Efficiency

✅ Strengths

Efficient implementation:

  • Single store lookup per health check
  • No blocking operations
  • Minimal CPU overhead with simple time comparisons
  • No goroutine leaks or resource accumulation

💡 Observation

Lazy mode handling (pkg/rpc/server/http.go:51-54): The code correctly uses LazyBlockInterval instead of BlockTime for lazy mode aggregators. This is the right approach.


5. Testing & Reliability

✅ Excellent Test Coverage

Unit tests (pkg/rpc/server/server_test.go:363-567):

  • Non-aggregator behavior ✓
  • Aggregator with no blocks ✓
  • Aggregator with recent blocks ✓
  • Slow block production (WARN) ✓
  • Stopped block production (FAIL) ✓
  • Lazy mode ✓
  • State error handling ✓

Client tests (pkg/rpc/client/client_test.go:249-338):

  • All health status scenarios covered ✓

Integration test (node/single_sequencer_integration_test.go:451-491):

  • Real-world scenario with actual block production ✓
  • Uses require.Eventually for polling instead of fixed sleeps ✓

⚠️ Test Robustness Issue

Integration test timing (node/single_sequencer_integration_test.go:471-487): The test uses require.Eventually which is excellent! However, @gemini-code-assist's suggestion to use polling in the integration test has already been implemented correctly. Good work!


6. Documentation & Developer Experience

⚠️ Areas for Improvement

Missing configuration documentation: The health check multipliers (3x and 5x) are not documented in the configuration flags or comments. Developers won't know these values without reading the source code.

Endpoint path discrepancy: The endpoint is at /health/live but the issue and PR discussion refer to "health endpoint" generically. The PR description should clarify:

  • /health/live - Liveness check (monitors block production for aggregators)
  • /health/ready - Readiness check (monitors peer connectivity and sync status)

LazyBlockInterval usage not documented: The code correctly uses LazyBlockInterval for lazy mode, but this behavior isn't documented in comments.


7. Critical Concerns from Other Reviewers

🚨 High Priority Issues

1. Threshold Configuration (@auricom, @tac0turtle) - CRITICAL

The current relative threshold approach (3x/5x block time) can be problematic:

  • Fast chains (250ms blocks): Only 750ms/1250ms tolerance - too aggressive
  • Slow chains (15s blocks): 45s/75s tolerance - potentially too lenient

Recommendation:

  • Add configuration flags for these thresholds
  • Consider absolute minimum thresholds (e.g., 3-5 seconds) as suggested by @auricom
  • Document the trade-offs clearly

2. Incomplete Health Checks (@auricom) - HIGH PRIORITY

@auricom correctly identified that the PR only covers point 2 of 4 required health checks:

  1. RPC server operational - Not validated by this PR
  2. Aggregator block production - Covered by this PR
  3. Full node synchronization - Not validated by this PR
  4. P2P network ready - Partially covered by readiness endpoint, but not in liveness

Current Implementation:

  • Liveness (/health/live): Only checks aggregator block production
  • Readiness (/health/ready): Checks peer connectivity and sync status

Issue: The liveness endpoint should also validate that the RPC server can respond to queries. Currently, if the endpoint returns a response at all, that implicitly shows the RPC server is working, but there's no validation of the underlying query path (e.g., can GetState actually work?).

Recommendation for follow-up:

// Pseudo-code for more comprehensive liveness check
func healthLive() {
    // 1. Validate store is accessible (already done via GetState)
    
    // 2. For aggregators: check block production (current implementation)
    
    // 3. For full nodes: check sync status
    if !cfg.Node.Aggregator {
        // Verify node is syncing (check if blocks are advancing)
        // This would require tracking height over time
    }
    
    // 4. P2P health (if not aggregator)
    if pm != nil && !cfg.Node.Aggregator {
        // Verify P2P is functional (can discover/connect to peers)
    }
}

3. Non-Aggregator Health (@tac0turtle)

@tac0turtle noted: "we should have something with syncing for non aggregator nodes as well. maybe a follow up."

This is a valid concern. Currently, non-aggregator nodes always return OK from /health/live, even if they've stopped syncing. The /health/ready endpoint does check sync status, but liveness should also detect if a full node has stopped making progress.


8. Code-Specific Issues

Minor Issues

1. Test code duplication (pkg/rpc/client/client_test.go):

The setupTestServer function works well, but there's repeated config setup across tests. Consider extracting common config builders:

func aggregatorConfig(blockTime time.Duration) config.Config {
    cfg := config.DefaultConfig()
    cfg.Node.Aggregator = true
    cfg.Node.BlockTime.Duration = blockTime
    return cfg
}

2. Magic numbers in tests:

Tests use -500 * time.Millisecond, -4 * time.Second, -10 * time.Second without explaining the relationship to thresholds. Adding comments would improve readability:

// Block is 500ms old, well within 3x threshold (3s) -> should be PASS
state := types.State{
    LastBlockHeight: 100,
    LastBlockTime:   time.Now().Add(-500 * time.Millisecond),
}

Summary & Recommendations

What This PR Does Well ✅

  1. Solves the core issue: detects stopped block production for aggregators
  2. Excellent test coverage with proper use of mocks and integration tests
  3. Clean, idiomatic Go code with proper error handling
  4. Appropriate log levels and structured logging
  5. Handles edge cases (no blocks yet, lazy mode, state errors)

Critical Actions Required 🚨

Before Merging:

  1. Decide on threshold strategy: Relative multipliers vs. absolute thresholds vs. configurable
  2. Document limitations: This PR only covers aggregator block production monitoring, not full health validation
  3. Consider @auricom's concerns: For HA setups with fast block times, current thresholds may be too aggressive

Follow-up Work (separate PRs):

  1. Add health monitoring for non-aggregator nodes (sync progress)
  2. Make thresholds configurable via config flags
  3. Consider more comprehensive liveness checks (RPC query validation, P2P health)
  4. Add metrics/telemetry for health check results

Verdict

This is good work that addresses the immediate issue (#2643), but the scope is narrower than a complete health check solution. The implementation is solid, tests are comprehensive, and the code quality is high. However, the threshold configuration needs attention before this goes into production environments, especially for HA clusters with fast block times.

Recommendation:

  • ✅ Approve the implementation approach
  • ⚠️ Request configuration flexibility for thresholds before merge
  • 📝 Document that this is phase 1 of health monitoring (aggregator-focused)
  • 🎯 Plan follow-up PRs for non-aggregator health and configurable thresholds

References


@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @randygrok, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves issue #2643 by significantly improving the accuracy of the health endpoint for aggregator nodes. Previously, the endpoint might incorrectly report a healthy status even when a node had ceased block production. The updated logic now checks the time since the last block was produced against dynamically calculated warning and failure thresholds, derived from the node's configured block time. This ensures that operational issues related to block generation are promptly detected and reflected in the health status, providing better visibility into node performance and stability.

Highlights

  • Enhanced Health Endpoint Logic: The health endpoint ('/livez') now actively monitors block production for aggregator nodes, reporting 'WARN' or 'FAIL' statuses if blocks are not produced within configurable time thresholds.
  • Configurable Thresholds: Introduced 'healthCheckWarnMultiplier' (3x block time) and 'healthCheckFailMultiplier' (5x block time) constants to define when a node's block production is considered slow or stopped.
  • Comprehensive Testing: Added new integration tests and expanded unit tests to cover various scenarios for the health endpoint, including non-aggregator nodes, aggregator nodes with recent blocks, slow block production, stopped block production, and lazy mode.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of the health endpoint not detecting stopped block production for aggregator nodes. The core logic is sound, and the changes are accompanied by a comprehensive set of unit tests and a new integration test. My feedback includes suggestions to improve the robustness of the integration test by replacing fixed-duration sleeps with polling, refactoring duplicated test setup code for better maintainability, and adjusting a log level to better reflect the severity of a health check failure.

@randygrok randygrok force-pushed the health-endpoint-block-check branch from 39dc439 to 819b015 Compare November 2, 2025 10:04
@codecov
Copy link

codecov bot commented Nov 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.80%. Comparing base (3d98502) to head (7769c77).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2800      +/-   ##
==========================================
+ Coverage   62.61%   62.80%   +0.18%     
==========================================
  Files          82       82              
  Lines        7334     7371      +37     
==========================================
+ Hits         4592     4629      +37     
  Misses       2197     2197              
  Partials      545      545              
Flag Coverage Δ
combined 62.80% <100.00%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

randygrok and others added 3 commits November 2, 2025 11:18
Remove createCustomTestServer function which was redundant after making
setupTestServer accept optional config parameter. This eliminates 36 lines
of duplicated server setup code.

Changes:
- Make setupTestServer accept variadic config parameter
- Update all test cases to use setupTestServer directly
- Remove createCustomTestServer function entirely

Result: -21 net lines of code, improved maintainability, cleaner API.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…n test

Replace fixed duration time.Sleep calls with require.Eventually polling
to make the health endpoint integration test more robust and less flaky.

Changes:
- Use require.Eventually to poll for health state transitions
- Poll every 100ms instead of fixed sleeps
- Generous timeouts (5s and 10s) that terminate early on success
- Better error handling during polling

Benefits:
- More resilient to timing variations in CI/CD environments
- Faster test execution (completes as soon as conditions are met)
- Eliminates magic numbers (1700ms compensation)
- Expresses intent clearly (wait until condition is met)
- Non-flaky (tested 3x consecutively)

// healthCheckFailMultiplier is the multiplier for block time to determine FAIL threshold
// If no block has been produced in (blockTime * healthCheckFailMultiplier), return FAIL
healthCheckFailMultiplier = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should double check with @auricom if this is good or too fast

Copy link
Contributor

@auricom auricom Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean for chains with a block time of 250ms, a node can only tolerate approximately 1 second of random hang time before the sequencer would be flagged as an eligible candidate for hard process termination.

We might improve safety by using an absolute timeout threshold (in seconds) rather than a configuration-relative value. This would provide more predictable failure detection across different chain configurations.

I'd say 3-5 seconds would be ok BUT I believe we should hear @alpe's perspective on what constitutes an appropriate timeout threshold before forcing a sequencer in an HA cluster to be considered stale, so leadership can be transferred to another.

}
}

// For non-aggregator nodes or if checks pass, return healthy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have something with syncing for non aggregator nodes as well. maybe a follow up.

tac0turtle
tac0turtle previously approved these changes Nov 3, 2025
Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK, we should make sure this covers @auricom's needs before merging

@randygrok randygrok added this pull request to the merge queue Nov 3, 2025
@tac0turtle tac0turtle removed this pull request from the merge queue due to a manual request Nov 3, 2025
@auricom
Copy link
Contributor

auricom commented Nov 4, 2025

A healthcheck endpoint is used by infrastructure automation to determine if a process can respond to client requests.

AFAIK based on this PR, the existing endpoint did not perform substantive validation.

For a node to be considered healthy, the following conditions should be met:

1- The RPC server is operational and responds correctly to basic client queries per specifications
2- The aggregator is producing blocks at the expected rate
3- The full node is synchronizing with the network
4- The P2P network is ready to accept incoming connections (if enabled)

I cannot verify whether this change effectively validates items 1 and 4. And I may be overly cautious, so I would appreciate your expertise on these points 🧑‍💻

@tac0turtle tac0turtle dismissed their stale review November 4, 2025 15:54

dismissing so we can cover all paths that claude mentioned

@randygrok
Copy link
Contributor Author

thanks @auricom , let me check those points.

By your definition only point 2 is covered

randygrok and others added 2 commits November 4, 2025 20:14
… HTTP endpoint

- Removed the HealthService and its related proto definitions.
- Implemented a new HTTP health check endpoint at `/health/live`.
- Updated the Client to use the new HTTP health check instead of the gRPC call.
- Enhanced health check logic to return PASS, WARN, or FAIL based on block production status.
- Modified tests to validate the new health check endpoint and its responses.
- Updated server to register custom HTTP endpoints including the new health check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

bug: health endpoint still reports OK when node has stopped producing blocks.

4 participants