fix: health endpoint now detects stopped block production #2800

randygrok · 2025-11-02T10:02:04Z

This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks.

Overview

This commit fixes issue #2643 where the health endpoint still reports OK when a node has stopped producing blocks. Changes: - Updated HealthServer to accept store, config, and logger dependencies - Implemented block production monitoring in the Livez endpoint: * For aggregator nodes, checks if LastBlockTime is recent * Returns WARN if block production is slow (> 3x block time) * Returns FAIL if block production has stopped (> 5x block time) * Uses LazyBlockInterval for lazy mode aggregators * Non-aggregator nodes continue to return PASS - Added constants for health check thresholds: * healthCheckWarnMultiplier = 3 * healthCheckFailMultiplier = 5 - Added comprehensive unit tests covering all scenarios: Server tests (pkg/rpc/server/server_test.go): * Non-aggregator nodes * Aggregator with no blocks * Aggregator with recent blocks (PASS) * Aggregator with slow production (WARN) * Aggregator with stopped production (FAIL) * Lazy aggregator with correct thresholds * Error handling Client tests (pkg/rpc/client/client_test.go): * Non-aggregator returns PASS * Aggregator with recent blocks returns PASS * Aggregator with slow block production returns WARN * Aggregator with stopped block production returns FAIL - Updated setupTestServer to pass new dependencies - Added createCustomTestServer helper for testing with custom configs The thresholds are configurable based on the node's BlockTime or LazyBlockInterval settings, making the health check adaptive to different node configurations. Fixes #2643

github-actions · 2025-11-02T10:02:16Z

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`❌ failed (1)`	Nov 5, 2025, 9:34 AM

claude · 2025-11-02T10:02:19Z

Claude finished @randygrok's task —— View job

Comprehensive Code Review

I've completed a thorough review of the health endpoint implementation. Here's my analysis:

Executive Summary

This PR successfully implements health monitoring for aggregator nodes by detecting stopped block production. The implementation is sound, but there are important considerations around threshold configuration and scope that should be addressed.

1. Code Quality & Architecture

✅ Strengths

Clean separation of concerns: The health check logic is well-organized with HTTP endpoint handlers in http.go (lines 32-85) and proper service wiring in server.go.

Appropriate use of Go idioms:

Proper use of http.Handler patterns
Clean error handling with structured logging
Good use of constants for multipliers (pkg/rpc/server/http.go:14-21)

Interface design: The BestKnownHeightProvider function type (pkg/rpc/server/http.go:26) provides good abstraction for the readiness endpoint.

⚠️ Areas for Improvement

Hardcoded multipliers lack flexibility (pkg/rpc/server/http.go:17-21):

const (
    healthCheckWarnMultiplier = 3
    healthCheckFailMultiplier = 5
)

Issue: For chains with very fast block times (e.g., 250ms), this means:

WARN threshold: 750ms (3 × 250ms)
FAIL threshold: 1250ms (5 × 250ms)

As @auricom correctly noted, this provides only ~1 second of tolerance before a node is marked for termination in an HA cluster. For chains with slow block times (e.g., 15s), the thresholds become 45s/75s, which may be too lenient.

Recommendation: Consider making these configurable or using absolute thresholds with sensible defaults:

// Suggested approach: Use absolute minimum thresholds
warnThreshold := max(blockTime * 3, 3 * time.Second)
failThreshold := max(blockTime * 5, 5 * time.Second)

This ensures reasonable behavior across different block time configurations while maintaining the relative scaling for longer block times.

2. Error Handling & Resilience

✅ Strengths

Proper error handling (pkg/rpc/server/http.go:37-42): Errors from GetState are properly logged and return FAIL with 503 Service Unavailable.

Graceful degradation: Non-aggregator nodes and nodes with no blocks yet return OK (lines 80-84), which is appropriate for startup scenarios.

Appropriate log levels: Uses Error() for FAIL states and Warn() for WARN states.

⚠️ Minor Issue

Log level for FAIL state (pkg/rpc/server/http.go:60-66): While the gemini-code-assist review suggested using Error instead of Warn for FAIL states, I see this was actually already implemented correctly - the code uses logger.Error() for FAIL states. Good!

3. Security

✅ No Security Concerns

No sensitive data exposed in logs or responses
Proper validation of state data before use
No authentication required for health endpoints (appropriate for health checks)
Read-only operations with no state modification

4. Performance & Resource Efficiency

✅ Strengths

Efficient implementation:

Single store lookup per health check
No blocking operations
Minimal CPU overhead with simple time comparisons
No goroutine leaks or resource accumulation

💡 Observation

Lazy mode handling (pkg/rpc/server/http.go:51-54): The code correctly uses LazyBlockInterval instead of BlockTime for lazy mode aggregators. This is the right approach.

5. Testing & Reliability

✅ Excellent Test Coverage

Unit tests (pkg/rpc/server/server_test.go:363-567):

Non-aggregator behavior ✓
Aggregator with no blocks ✓
Aggregator with recent blocks ✓
Slow block production (WARN) ✓
Stopped block production (FAIL) ✓
Lazy mode ✓
State error handling ✓

Client tests (pkg/rpc/client/client_test.go:249-338):

All health status scenarios covered ✓

Integration test (node/single_sequencer_integration_test.go:451-491):

Real-world scenario with actual block production ✓
Uses require.Eventually for polling instead of fixed sleeps ✓

⚠️ Test Robustness Issue

Integration test timing (node/single_sequencer_integration_test.go:471-487): The test uses require.Eventually which is excellent! However, @gemini-code-assist's suggestion to use polling in the integration test has already been implemented correctly. Good work!

6. Documentation & Developer Experience

⚠️ Areas for Improvement

Missing configuration documentation: The health check multipliers (3x and 5x) are not documented in the configuration flags or comments. Developers won't know these values without reading the source code.

Endpoint path discrepancy: The endpoint is at /health/live but the issue and PR discussion refer to "health endpoint" generically. The PR description should clarify:

/health/live - Liveness check (monitors block production for aggregators)
/health/ready - Readiness check (monitors peer connectivity and sync status)

LazyBlockInterval usage not documented: The code correctly uses LazyBlockInterval for lazy mode, but this behavior isn't documented in comments.

7. Critical Concerns from Other Reviewers

🚨 High Priority Issues

1. Threshold Configuration (@auricom, @tac0turtle) - CRITICAL

The current relative threshold approach (3x/5x block time) can be problematic:

Fast chains (250ms blocks): Only 750ms/1250ms tolerance - too aggressive
Slow chains (15s blocks): 45s/75s tolerance - potentially too lenient

Recommendation:

Add configuration flags for these thresholds
Consider absolute minimum thresholds (e.g., 3-5 seconds) as suggested by @auricom
Document the trade-offs clearly

2. Incomplete Health Checks (@auricom) - HIGH PRIORITY

@auricom correctly identified that the PR only covers point 2 of 4 required health checks:

❌ RPC server operational - Not validated by this PR
✅ Aggregator block production - Covered by this PR
❌ Full node synchronization - Not validated by this PR
❌ P2P network ready - Partially covered by readiness endpoint, but not in liveness

Current Implementation:

Liveness (/health/live): Only checks aggregator block production
Readiness (/health/ready): Checks peer connectivity and sync status

Issue: The liveness endpoint should also validate that the RPC server can respond to queries. Currently, if the endpoint returns a response at all, that implicitly shows the RPC server is working, but there's no validation of the underlying query path (e.g., can GetState actually work?).

Recommendation for follow-up:

// Pseudo-code for more comprehensive liveness check
func healthLive() {
    // 1. Validate store is accessible (already done via GetState)
    
    // 2. For aggregators: check block production (current implementation)
    
    // 3. For full nodes: check sync status
    if !cfg.Node.Aggregator {
        // Verify node is syncing (check if blocks are advancing)
        // This would require tracking height over time
    }
    
    // 4. P2P health (if not aggregator)
    if pm != nil && !cfg.Node.Aggregator {
        // Verify P2P is functional (can discover/connect to peers)
    }
}

3. Non-Aggregator Health (@tac0turtle)

@tac0turtle noted: "we should have something with syncing for non aggregator nodes as well. maybe a follow up."

This is a valid concern. Currently, non-aggregator nodes always return OK from /health/live, even if they've stopped syncing. The /health/ready endpoint does check sync status, but liveness should also detect if a full node has stopped making progress.

8. Code-Specific Issues

Minor Issues

1. Test code duplication (pkg/rpc/client/client_test.go):

The setupTestServer function works well, but there's repeated config setup across tests. Consider extracting common config builders:

func aggregatorConfig(blockTime time.Duration) config.Config {
    cfg := config.DefaultConfig()
    cfg.Node.Aggregator = true
    cfg.Node.BlockTime.Duration = blockTime
    return cfg
}

2. Magic numbers in tests:

Tests use -500 * time.Millisecond, -4 * time.Second, -10 * time.Second without explaining the relationship to thresholds. Adding comments would improve readability:

// Block is 500ms old, well within 3x threshold (3s) -> should be PASS
state := types.State{
    LastBlockHeight: 100,
    LastBlockTime:   time.Now().Add(-500 * time.Millisecond),
}

Summary & Recommendations

What This PR Does Well ✅

Solves the core issue: detects stopped block production for aggregators
Excellent test coverage with proper use of mocks and integration tests
Clean, idiomatic Go code with proper error handling
Appropriate log levels and structured logging
Handles edge cases (no blocks yet, lazy mode, state errors)

Critical Actions Required 🚨

Before Merging:

Decide on threshold strategy: Relative multipliers vs. absolute thresholds vs. configurable
Document limitations: This PR only covers aggregator block production monitoring, not full health validation
Consider @auricom's concerns: For HA setups with fast block times, current thresholds may be too aggressive

Follow-up Work (separate PRs):

Add health monitoring for non-aggregator nodes (sync progress)
Make thresholds configurable via config flags
Consider more comprehensive liveness checks (RPC query validation, P2P health)
Add metrics/telemetry for health check results

Verdict

This is good work that addresses the immediate issue (#2643), but the scope is narrower than a complete health check solution. The implementation is solid, tests are comprehensive, and the code quality is high. However, the threshold configuration needs attention before this goes into production environments, especially for HA clusters with fast block times.

Recommendation:

✅ Approve the implementation approach
⚠️ Request configuration flexibility for thresholds before merge
📝 Document that this is phase 1 of health monitoring (aggregator-focused)
🎯 Plan follow-up PRs for non-aggregator health and configurable thresholds

References

Issue bug: health endpoint still reports OK when node has stopped producing blocks. #2643: "health endpoint still reports OK when node has stopped producing blocks"
Key files reviewed: pkg/rpc/server/http.go:14-85, pkg/rpc/server/server.go, pkg/rpc/client/client.go:148-179, test files
Integration test: node/single_sequencer_integration_test.go:451-491

gemini-code-assist · 2025-11-02T10:02:23Z

Summary of Changes

Hello @randygrok, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves issue #2643 by significantly improving the accuracy of the health endpoint for aggregator nodes. Previously, the endpoint might incorrectly report a healthy status even when a node had ceased block production. The updated logic now checks the time since the last block was produced against dynamically calculated warning and failure thresholds, derived from the node's configured block time. This ensures that operational issues related to block generation are promptly detected and reflected in the health status, providing better visibility into node performance and stability.

Highlights

Enhanced Health Endpoint Logic: The health endpoint ('/livez') now actively monitors block production for aggregator nodes, reporting 'WARN' or 'FAIL' statuses if blocks are not produced within configurable time thresholds.
Configurable Thresholds: Introduced 'healthCheckWarnMultiplier' (3x block time) and 'healthCheckFailMultiplier' (5x block time) constants to define when a node's block production is considered slow or stopped.
Comprehensive Testing: Added new integration tests and expanded unit tests to cover various scenarios for the health endpoint, including non-aggregator nodes, aggregator nodes with recent blocks, slow block production, stopped block production, and lazy mode.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses the issue of the health endpoint not detecting stopped block production for aggregator nodes. The core logic is sound, and the changes are accompanied by a comprehensive set of unit tests and a new integration test. My feedback includes suggestions to improve the robustness of the integration test by replacing fixed-duration sleeps with polling, refactoring duplicated test setup code for better maintainability, and adjusting a log level to better reflect the severity of a health check failure.

node/single_sequencer_integration_test.go

pkg/rpc/client/client_test.go

pkg/rpc/server/server.go

codecov · 2025-11-02T10:06:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.80%. Comparing base (3d98502) to head (7769c77).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2800      +/-   ##
==========================================
+ Coverage   62.61%   62.80%   +0.18%     
==========================================
  Files          82       82              
  Lines        7334     7371      +37     
==========================================
+ Hits         4592     4629      +37     
  Misses       2197     2197              
  Partials      545      545

Flag	Coverage Δ
combined	`62.80% <100.00%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Remove createCustomTestServer function which was redundant after making setupTestServer accept optional config parameter. This eliminates 36 lines of duplicated server setup code. Changes: - Make setupTestServer accept variadic config parameter - Update all test cases to use setupTestServer directly - Remove createCustomTestServer function entirely Result: -21 net lines of code, improved maintainability, cleaner API.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…n test Replace fixed duration time.Sleep calls with require.Eventually polling to make the health endpoint integration test more robust and less flaky. Changes: - Use require.Eventually to poll for health state transitions - Poll every 100ms instead of fixed sleeps - Generous timeouts (5s and 10s) that terminate early on success - Better error handling during polling Benefits: - More resilient to timing variations in CI/CD environments - Faster test execution (completes as soon as conditions are met) - Eliminates magic numbers (1700ms compensation) - Expresses intent clearly (wait until condition is met) - Non-flaky (tested 3x consecutively)

tac0turtle · 2025-11-03T10:22:55Z

pkg/rpc/server/server.go

+
+	// healthCheckFailMultiplier is the multiplier for block time to determine FAIL threshold
+	// If no block has been produced in (blockTime * healthCheckFailMultiplier), return FAIL
+	healthCheckFailMultiplier = 5


we should double check with @auricom if this is good or too fast

This would mean for chains with a block time of 250ms, a node can only tolerate approximately 1 second of random hang time before the sequencer would be flagged as an eligible candidate for hard process termination.

We might improve safety by using an absolute timeout threshold (in seconds) rather than a configuration-relative value. This would provide more predictable failure detection across different chain configurations.

I'd say 3-5 seconds would be ok BUT I believe we should hear @alpe's perspective on what constitutes an appropriate timeout threshold before forcing a sequencer in an HA cluster to be considered stale, so leadership can be transferred to another.

tac0turtle · 2025-11-03T10:24:35Z

pkg/rpc/server/server.go

+		}
+	}
+
+	// For non-aggregator nodes or if checks pass, return healthy


we should have something with syncing for non aggregator nodes as well. maybe a follow up.

tac0turtle

utACK, we should make sure this covers @auricom's needs before merging

auricom · 2025-11-04T15:53:41Z

A healthcheck endpoint is used by infrastructure automation to determine if a process can respond to client requests.

AFAIK based on this PR, the existing endpoint did not perform substantive validation.

For a node to be considered healthy, the following conditions should be met:

1- The RPC server is operational and responds correctly to basic client queries per specifications
2- The aggregator is producing blocks at the expected rate
3- The full node is synchronizing with the network
4- The P2P network is ready to accept incoming connections (if enabled)

I cannot verify whether this change effectively validates items 1 and 4. And I may be overly cautious, so I would appreciate your expertise on these points 🧑‍💻

dismissing so we can cover all paths that claude mentioned

randygrok · 2025-11-04T19:14:46Z

thanks @auricom , let me check those points.

By your definition only point 2 is covered

… HTTP endpoint - Removed the HealthService and its related proto definitions. - Implemented a new HTTP health check endpoint at `/health/live`. - Updated the Client to use the new HTTP health check instead of the gRPC call. - Enhanced health check logic to return PASS, WARN, or FAIL based on block production status. - Modified tests to validate the new health check endpoint and its responses. - Updated server to register custom HTTP endpoints including the new health check.

github-project-automation bot added this to Evolve Nov 2, 2025

github-actions bot assigned randygrok Nov 2, 2025

gemini-code-assist bot reviewed Nov 2, 2025

View reviewed changes

node/single_sequencer_integration_test.go Outdated Show resolved Hide resolved

pkg/rpc/client/client_test.go Outdated Show resolved Hide resolved

pkg/rpc/server/server.go Outdated Show resolved Hide resolved

randygrok force-pushed the health-endpoint-block-check branch from 39dc439 to 819b015 Compare November 2, 2025 10:04

randygrok and others added 3 commits November 2, 2025 11:18

Apply suggestion from @gemini-code-assist[bot]

f9abded

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

randygrok enabled auto-merge November 2, 2025 10:43

randygrok requested review from julienrbrt and tac0turtle November 2, 2025 10:43

tac0turtle requested a review from auricom November 2, 2025 14:55

tac0turtle reviewed Nov 3, 2025

View reviewed changes

tac0turtle previously approved these changes Nov 3, 2025

View reviewed changes

randygrok added this pull request to the merge queue Nov 3, 2025

tac0turtle removed this pull request from the merge queue due to a manual request Nov 3, 2025

randygrok and others added 2 commits November 4, 2025 20:14

Merge branch 'main' into health-endpoint-block-check

7769c77

fix: health endpoint now detects stopped block production #2800

Are you sure you want to change the base?

fix: health endpoint now detects stopped block production #2800

Uh oh!

Conversation

randygrok commented Nov 2, 2025

Overview

Uh oh!

github-actions bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review

Executive Summary

1. Code Quality & Architecture

✅ Strengths

⚠️ Areas for Improvement

2. Error Handling & Resilience

✅ Strengths

⚠️ Minor Issue

3. Security

✅ No Security Concerns

4. Performance & Resource Efficiency

✅ Strengths

💡 Observation

5. Testing & Reliability

✅ Excellent Test Coverage

⚠️ Test Robustness Issue

6. Documentation & Developer Experience

⚠️ Areas for Improvement

7. Critical Concerns from Other Reviewers

🚨 High Priority Issues

8. Code-Specific Issues

Minor Issues

Summary & Recommendations

What This PR Does Well ✅

Critical Actions Required 🚨

Verdict

References

Uh oh!

gemini-code-assist bot commented Nov 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tac0turtle Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

auricom Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tac0turtle Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

tac0turtle left a comment

Choose a reason for hiding this comment

Uh oh!

auricom commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randygrok commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

github-actions bot commented Nov 2, 2025 •

edited

Loading

claude bot commented Nov 2, 2025 •

edited

Loading

codecov bot commented Nov 2, 2025 •

edited

Loading

auricom Nov 4, 2025 •

edited

Loading

auricom commented Nov 4, 2025 •

edited

Loading