feat: make validator startup more resilient#7342
Conversation
|
📝 WalkthroughWalkthroughThe change adds a retry wrapper to initialize the merkle tree hook sync: a new constant and helper attempt starting the task up to N times with sleep/backoff, updates logging and critical-error metrics on failures, and replaces the previous direct spawn flow in the main run loop. Changes
Sequence DiagramsequenceDiagram
participant Main as Main Run Loop
participant Counter as Count Merkle Tree Hook
participant Retry as try_n_times_to_run_merkle_tree_hook_sync
participant Spawner as spawn merkle_tree_hook_sync Task
Main->>Counter: Request hook count
alt Count succeeds
Counter-->>Main: Return count
Main->>Retry: Invoke retry wrapper
loop attempts (1..N)
Retry->>Spawner: Try spawn task
alt Spawn succeeds
Spawner-->>Retry: Task running
Retry-->>Main: Success (break)
else Spawn fails
Spawner-->>Retry: Error
Retry->>Retry: Log failure, mark critical-error metric
Retry->>Retry: Sleep (RPC_RETRY_SLEEP_DURATION)
end
end
alt All attempts exhausted
Retry-->>Main: Failure after N attempts
Main->>Main: Log critical error & abort run
end
else Count fails
Counter-->>Main: Error
Main->>Main: Log error & sleep, then retry loop
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
rust/main/agents/validator/src/validator.rs (1)
335-364: Retry logic works, but consider capturing the last error for better debugging.The retry mechanism does what it needs to—keeps trying until it succeeds or runs out of patience. The metrics handling is spot-on too. A few things that might make this even better:
- The final error message at lines 360-363 doesn't include the last error that was encountered. When debugging failures, you'll want to know what actually went wrong on the last attempt.
- The
attempt_countstarts at 0 (line 346), so logs show "attempt_count = 0" for the first try. Might be clearer to show "attempt 1 of 10" style.- All errors get the same treatment—network hiccups and config problems both trigger the same retry. That's probably fine for this use case since most failures will be transient RPC issues.
Here's a diff to capture the last error for better debugging:
async fn try_n_times_to_run_merkle_tree_hook_sync( &self, attempts: usize, ) -> eyre::Result<JoinHandle<()>> { + let mut last_err = None; for i in 0..attempts { let task = match self.run_merkle_tree_hook_sync().await { Ok(s) => s, Err(err) => { tracing::error!( ?err, domain = self.origin_chain.name(), attempt_count = i, "Failed to run merkle tree hook sync" ); + last_err = Some(err); sleep(RPC_RETRY_SLEEP_DURATION).await; continue; } }; self.chain_metrics .set_critical_error(self.origin_chain.name(), false); return Ok(task); } self.chain_metrics .set_critical_error(self.origin_chain.name(), true); - Err(eyre::eyre!( - "Failed to initialize merkle tree hook sync after {} attempts", - attempts - )) + let base_msg = format!( + "Failed to initialize merkle tree hook sync after {} attempts", + attempts + ); + if let Some(err) = last_err { + Err(eyre::eyre!("{}: {:?}", base_msg, err)) + } else { + Err(eyre::eyre!(base_msg)) + } }And if you want to make the attempt count more intuitive:
Err(err) => { tracing::error!( ?err, domain = self.origin_chain.name(), - attempt_count = i, + attempt_count = i + 1, + total_attempts = attempts, "Failed to run merkle tree hook sync" ); last_err = Some(err); sleep(RPC_RETRY_SLEEP_DURATION).await; continue; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
rust/main/agents/validator/src/validator.rs(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
rust/main/agents/validator/**/src/**/*.rs
📄 CodeRabbit inference engine (CLAUDE.md)
Maintain validator agent Rust sources under rust/main/agents/validator
Files:
rust/main/agents/validator/src/validator.rs
rust/main/**/src/**/*.rs
📄 CodeRabbit inference engine (CLAUDE.md)
Run cargo clippy for Rust code linting
Files:
rust/main/agents/validator/src/validator.rs
🧠 Learnings (2)
📚 Learning: 2025-09-02T18:44:06.598Z
Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/agents/validator/**/src/**/*.rs : Maintain validator agent Rust sources under rust/main/agents/validator
Applied to files:
rust/main/agents/validator/src/validator.rs
📚 Learning: 2025-09-02T18:44:06.598Z
Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}/**/src/**/*.rs : Keep chain support implementations within rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}
Applied to files:
rust/main/agents/validator/src/validator.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (57)
- GitHub Check: yarn-test
- GitHub Check: cli-evm-e2e-matrix (warp-check-3)
- GitHub Check: cli-evm-e2e-matrix (warp-rebalancer)
- GitHub Check: cli-evm-e2e-matrix (warp-extend-config)
- GitHub Check: cli-evm-e2e-matrix (warp-deploy-2)
- GitHub Check: cli-evm-e2e-matrix (warp-read)
- GitHub Check: cli-evm-e2e-matrix (warp-init)
- GitHub Check: cli-evm-e2e-matrix (warp-extend-recovery)
- GitHub Check: cli-evm-e2e-matrix (warp-deploy-1)
- GitHub Check: cli-evm-e2e-matrix (warp-check-5)
- GitHub Check: cli-evm-e2e-matrix (warp-send)
- GitHub Check: cli-evm-e2e-matrix (warp-bridge-1)
- GitHub Check: cli-evm-e2e-matrix (warp-check-4)
- GitHub Check: cli-evm-e2e-matrix (warp-check-2)
- GitHub Check: cli-evm-e2e-matrix (warp-extend-basic)
- GitHub Check: cli-evm-e2e-matrix (relay)
- GitHub Check: cli-evm-e2e-matrix (warp-bridge-2)
- GitHub Check: cli-evm-e2e-matrix (warp-apply-submitters)
- GitHub Check: cli-evm-e2e-matrix (core-init)
- GitHub Check: cli-evm-e2e-matrix (core-read)
- GitHub Check: cli-evm-e2e-matrix (warp-apply-1)
- GitHub Check: cli-evm-e2e-matrix (warp-check-1)
- GitHub Check: cli-evm-e2e-matrix (warp-apply-2)
- GitHub Check: cli-evm-e2e-matrix (warp-apply-ism-updates)
- GitHub Check: cli-evm-e2e-matrix (core-deploy)
- GitHub Check: cli-evm-e2e-matrix (core-apply)
- GitHub Check: cli-evm-e2e-matrix (core-check)
- GitHub Check: env-test-matrix (testnet4, sepolia, core)
- GitHub Check: env-test-matrix (mainnet3, optimism, core)
- GitHub Check: env-test-matrix (mainnet3, inevm, igp)
- GitHub Check: env-test-matrix (mainnet3, optimism, igp)
- GitHub Check: env-test-matrix (mainnet3, arbitrum, igp)
- GitHub Check: env-test-matrix (mainnet3, ethereum, igp)
- GitHub Check: env-test-matrix (mainnet3, inevm, core)
- GitHub Check: env-test-matrix (mainnet3, arbitrum, core)
- GitHub Check: env-test-matrix (mainnet3, ethereum, core)
- GitHub Check: cli-cross-chain-e2e-matrix (warp-deploy)
- GitHub Check: cli-cosmos-e2e-matrix (core-deploy)
- GitHub Check: cli-cross-chain-e2e-matrix (warp-apply)
- GitHub Check: cli-cosmos-e2e-matrix (warp-deploy)
- GitHub Check: cli-cosmos-e2e-matrix (warp-read)
- GitHub Check: cosmos-sdk-e2e-run
- GitHub Check: cli-cosmos-e2e-matrix (core-read)
- GitHub Check: cli-cosmos-e2e-matrix (core-apply)
- GitHub Check: cli-radix-e2e-matrix (core-deploy)
- GitHub Check: cli-cosmos-e2e-matrix (core-check)
- GitHub Check: cli-install-test-run
- GitHub Check: build-and-push-to-gcr
- GitHub Check: e2e-matrix (radix)
- GitHub Check: e2e-matrix (sealevel)
- GitHub Check: e2e-matrix (starknet)
- GitHub Check: e2e-matrix (evm)
- GitHub Check: e2e-matrix (cosmwasm)
- GitHub Check: e2e-matrix (cosmosnative)
- GitHub Check: lint-rs
- GitHub Check: lander-coverage
- GitHub Check: test-rs
🔇 Additional comments (4)
rust/main/agents/validator/src/validator.rs (4)
25-28: Imports look solid.The addition of
RPC_RETRY_SLEEP_DURATIONis what you need for the retry backoff. Everything's in order here.
296-308: Good improvement to error handling.Adding proper logging and retry behavior when fetching the merkle tree hook count makes the startup more resilient. The flow here makes sense—you're giving it time to recover before trying again.
310-323: Retry invocation looks good, fail-fast behavior is appropriate.The control flow here does what you want—try multiple times to get the merkle tree hook sync going, and if it doesn't work after all attempts, bail out rather than run in a broken state. That's the right call for startup.
40-40: Retry count is appropriate—no changes needed.The 10 attempts with 2-second intervals between retries gives roughly 18 seconds maximum wait time at startup, which provides solid resilience without being excessive. The implementation correctly uses the shared
RPC_RETRY_SLEEP_DURATIONconstant for consistency. This strikes a good balance and doesn't require adjustment.
…e-monorepo into jeff/build-cursor
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
rust/main/agents/validator/src/validator.rs (1)
335-364: Retry logic is solid, with room for minor polish.The implementation correctly handles retries, logs appropriately, and manages critical error state. A couple of optional refinements if you're feeling particular:
- Attempt numbering: Logging
attempt_count = ishows 0-9 instead of 1-10. Considerattempt_count = i + 1for clarity.- Unnecessary sleep: Line 350 sleeps even after the last failed attempt. Moving the sleep to the top of the loop or adding a check (
if i < attempts - 1) would tighten things up, though the current approach is harmless.These are nitpicks—the logic itself does what it needs to do.
Apply this diff if you want 1-indexed attempt counts:
Err(err) => { error!( ?err, domain = self.origin_chain.name(), - attempt_count = i, + attempt_count = i + 1, "Failed to run merkle tree hook sync" ); sleep(RPC_RETRY_SLEEP_DURATION).await; continue; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
rust/main/agents/validator/src/validator.rs(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
rust/main/agents/validator/**/src/**/*.rs
📄 CodeRabbit inference engine (CLAUDE.md)
Maintain validator agent Rust sources under rust/main/agents/validator
Files:
rust/main/agents/validator/src/validator.rs
rust/main/**/src/**/*.rs
📄 CodeRabbit inference engine (CLAUDE.md)
Run cargo clippy for Rust code linting
Files:
rust/main/agents/validator/src/validator.rs
🧠 Learnings (2)
📚 Learning: 2025-09-02T18:44:06.598Z
Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/agents/validator/**/src/**/*.rs : Maintain validator agent Rust sources under rust/main/agents/validator
Applied to files:
rust/main/agents/validator/src/validator.rs
📚 Learning: 2025-09-02T18:44:06.598Z
Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}/**/src/**/*.rs : Keep chain support implementations within rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}
Applied to files:
rust/main/agents/validator/src/validator.rs
🔇 Additional comments (4)
rust/main/agents/validator/src/validator.rs (4)
25-27: Import looks good.The
RPC_RETRY_SLEEP_DURATIONconstant is properly imported and used for pacing retry attempts. Clean and straightforward.
40-40: Reasonable retry limit for startup.Ten attempts is a solid middle ground—enough to handle transient failures without hanging startup indefinitely. If you find yourself needing more flexibility down the road, this could move into settings, but for now it does the job.
296-299: Verify indefinite retry behavior is intended.Unlike the new retry logic below (which has a max attempt limit), this error path will loop forever if
count()keeps failing. That's probably fine if you're expecting eventual consistency, but if the RPC is permanently misconfigured or the contract is broken, the validator just sits here spinning its wheels.Worth confirming this matches your operational intent—do you want the validator to stay stuck here, or should there be some eventual bailout with critical error state?
310-323: Clean integration of retry logic.The error path here properly aborts startup when all retry attempts are exhausted, which is the right call—no point running a validator that can't sync its merkle tree. Checkpoint submitters only start after successful sync initialization, maintaining proper sequencing.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7342 +/- ##
============================
============================
🚀 New features to boost your workflow:
|
Description
Related issues
Summary by CodeRabbit