Skip to content

feat: make validator startup more resilient#7342

Merged
kamiyaa merged 5 commits intomainfrom
jeff/build-cursor
Nov 7, 2025
Merged

feat: make validator startup more resilient#7342
kamiyaa merged 5 commits intomainfrom
jeff/build-cursor

Conversation

@kamiyaa
Copy link
Copy Markdown
Contributor

@kamiyaa kamiyaa commented Nov 6, 2025

Description

  • retry multiple times before giving up for validator startup

Related issues

Summary by CodeRabbit

  • Bug Fixes
    • Added a retry-and-backoff mechanism when initializing the merkle tree hook to reduce startup failures.
    • Improved logging for initialization attempts and failures, including attempt counts and domain context.
    • Enhanced error handling to abort startup on repeated failures and to update critical-error metrics for faster detection.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Nov 6, 2025

⚠️ No Changeset found

Latest commit: b7e47c0

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 6, 2025

📝 Walkthrough

Walkthrough

The change adds a retry wrapper to initialize the merkle tree hook sync: a new constant and helper attempt starting the task up to N times with sleep/backoff, updates logging and critical-error metrics on failures, and replaces the previous direct spawn flow in the main run loop.

Changes

Cohort / File(s) Summary
Merkle Tree Hook Sync Retry Logic
rust/main/agents/validator/src/validator.rs
Adds CURSOR_INSTANTIATION_ATTEMPTS and try_n_times_to_run_merkle_tree_hook_sync to retry spawning merkle_tree_hook_sync up to N times using RPC_RETRY_SLEEP_DURATION; replaces direct task spawn in the main run loop, adds domain-aware logging and critical-error metric updates, and adjusts error/sleep control flow when counting the merkle tree hook.

Sequence Diagram

sequenceDiagram
    participant Main as Main Run Loop
    participant Counter as Count Merkle Tree Hook
    participant Retry as try_n_times_to_run_merkle_tree_hook_sync
    participant Spawner as spawn merkle_tree_hook_sync Task

    Main->>Counter: Request hook count
    alt Count succeeds
        Counter-->>Main: Return count
        Main->>Retry: Invoke retry wrapper
        loop attempts (1..N)
            Retry->>Spawner: Try spawn task
            alt Spawn succeeds
                Spawner-->>Retry: Task running
                Retry-->>Main: Success (break)
            else Spawn fails
                Spawner-->>Retry: Error
                Retry->>Retry: Log failure, mark critical-error metric
                Retry->>Retry: Sleep (RPC_RETRY_SLEEP_DURATION)
            end
        end
        alt All attempts exhausted
            Retry-->>Main: Failure after N attempts
            Main->>Main: Log critical error & abort run
        end
    else Count fails
        Counter-->>Main: Error
        Main->>Main: Log error & sleep, then retry loop
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Check correctness of attempt counting and sleep/backoff usage.
  • Verify spawn error handling doesn't leak tasks or state and updates critical-error metrics in all failure branches.
  • Confirm logging provides domain and attempt context for debugging.

Possibly related PRs

Suggested reviewers

  • ameten

Poem

A stubborn hook in murky bog,
We prod, we nap, we try like slog.
With backoff breaths and measured tries,
The sync will rise before it dies. 🌲⚙️

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The description covers the main purpose but lacks several template sections: drive-by changes, backward compatibility, and testing details are missing or incomplete. Add missing sections for drive-by changes, backward compatibility assessment, and testing methodology to complete the description template.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: adding retry logic to make validator startup more resilient, which aligns with the code changes introducing retry mechanisms.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jeff/build-cursor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
rust/main/agents/validator/src/validator.rs (1)

335-364: Retry logic works, but consider capturing the last error for better debugging.

The retry mechanism does what it needs to—keeps trying until it succeeds or runs out of patience. The metrics handling is spot-on too. A few things that might make this even better:

  1. The final error message at lines 360-363 doesn't include the last error that was encountered. When debugging failures, you'll want to know what actually went wrong on the last attempt.
  2. The attempt_count starts at 0 (line 346), so logs show "attempt_count = 0" for the first try. Might be clearer to show "attempt 1 of 10" style.
  3. All errors get the same treatment—network hiccups and config problems both trigger the same retry. That's probably fine for this use case since most failures will be transient RPC issues.

Here's a diff to capture the last error for better debugging:

 async fn try_n_times_to_run_merkle_tree_hook_sync(
     &self,
     attempts: usize,
 ) -> eyre::Result<JoinHandle<()>> {
+    let mut last_err = None;
     for i in 0..attempts {
         let task = match self.run_merkle_tree_hook_sync().await {
             Ok(s) => s,
             Err(err) => {
                 tracing::error!(
                     ?err,
                     domain = self.origin_chain.name(),
                     attempt_count = i,
                     "Failed to run merkle tree hook sync"
                 );
+                last_err = Some(err);
                 sleep(RPC_RETRY_SLEEP_DURATION).await;
                 continue;
             }
         };
         self.chain_metrics
             .set_critical_error(self.origin_chain.name(), false);
         return Ok(task);
     }
     self.chain_metrics
         .set_critical_error(self.origin_chain.name(), true);
-    Err(eyre::eyre!(
-        "Failed to initialize merkle tree hook sync after {} attempts",
-        attempts
-    ))
+    let base_msg = format!(
+        "Failed to initialize merkle tree hook sync after {} attempts",
+        attempts
+    );
+    if let Some(err) = last_err {
+        Err(eyre::eyre!("{}: {:?}", base_msg, err))
+    } else {
+        Err(eyre::eyre!(base_msg))
+    }
 }

And if you want to make the attempt count more intuitive:

             Err(err) => {
                 tracing::error!(
                     ?err,
                     domain = self.origin_chain.name(),
-                    attempt_count = i,
+                    attempt_count = i + 1,
+                    total_attempts = attempts,
                     "Failed to run merkle tree hook sync"
                 );
                 last_err = Some(err);
                 sleep(RPC_RETRY_SLEEP_DURATION).await;
                 continue;
             }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b872a03 and e7b194d.

📒 Files selected for processing (1)
  • rust/main/agents/validator/src/validator.rs (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
rust/main/agents/validator/**/src/**/*.rs

📄 CodeRabbit inference engine (CLAUDE.md)

Maintain validator agent Rust sources under rust/main/agents/validator

Files:

  • rust/main/agents/validator/src/validator.rs
rust/main/**/src/**/*.rs

📄 CodeRabbit inference engine (CLAUDE.md)

Run cargo clippy for Rust code linting

Files:

  • rust/main/agents/validator/src/validator.rs
🧠 Learnings (2)
📚 Learning: 2025-09-02T18:44:06.598Z
Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/agents/validator/**/src/**/*.rs : Maintain validator agent Rust sources under rust/main/agents/validator

Applied to files:

  • rust/main/agents/validator/src/validator.rs
📚 Learning: 2025-09-02T18:44:06.598Z
Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}/**/src/**/*.rs : Keep chain support implementations within rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}

Applied to files:

  • rust/main/agents/validator/src/validator.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (57)
  • GitHub Check: yarn-test
  • GitHub Check: cli-evm-e2e-matrix (warp-check-3)
  • GitHub Check: cli-evm-e2e-matrix (warp-rebalancer)
  • GitHub Check: cli-evm-e2e-matrix (warp-extend-config)
  • GitHub Check: cli-evm-e2e-matrix (warp-deploy-2)
  • GitHub Check: cli-evm-e2e-matrix (warp-read)
  • GitHub Check: cli-evm-e2e-matrix (warp-init)
  • GitHub Check: cli-evm-e2e-matrix (warp-extend-recovery)
  • GitHub Check: cli-evm-e2e-matrix (warp-deploy-1)
  • GitHub Check: cli-evm-e2e-matrix (warp-check-5)
  • GitHub Check: cli-evm-e2e-matrix (warp-send)
  • GitHub Check: cli-evm-e2e-matrix (warp-bridge-1)
  • GitHub Check: cli-evm-e2e-matrix (warp-check-4)
  • GitHub Check: cli-evm-e2e-matrix (warp-check-2)
  • GitHub Check: cli-evm-e2e-matrix (warp-extend-basic)
  • GitHub Check: cli-evm-e2e-matrix (relay)
  • GitHub Check: cli-evm-e2e-matrix (warp-bridge-2)
  • GitHub Check: cli-evm-e2e-matrix (warp-apply-submitters)
  • GitHub Check: cli-evm-e2e-matrix (core-init)
  • GitHub Check: cli-evm-e2e-matrix (core-read)
  • GitHub Check: cli-evm-e2e-matrix (warp-apply-1)
  • GitHub Check: cli-evm-e2e-matrix (warp-check-1)
  • GitHub Check: cli-evm-e2e-matrix (warp-apply-2)
  • GitHub Check: cli-evm-e2e-matrix (warp-apply-ism-updates)
  • GitHub Check: cli-evm-e2e-matrix (core-deploy)
  • GitHub Check: cli-evm-e2e-matrix (core-apply)
  • GitHub Check: cli-evm-e2e-matrix (core-check)
  • GitHub Check: env-test-matrix (testnet4, sepolia, core)
  • GitHub Check: env-test-matrix (mainnet3, optimism, core)
  • GitHub Check: env-test-matrix (mainnet3, inevm, igp)
  • GitHub Check: env-test-matrix (mainnet3, optimism, igp)
  • GitHub Check: env-test-matrix (mainnet3, arbitrum, igp)
  • GitHub Check: env-test-matrix (mainnet3, ethereum, igp)
  • GitHub Check: env-test-matrix (mainnet3, inevm, core)
  • GitHub Check: env-test-matrix (mainnet3, arbitrum, core)
  • GitHub Check: env-test-matrix (mainnet3, ethereum, core)
  • GitHub Check: cli-cross-chain-e2e-matrix (warp-deploy)
  • GitHub Check: cli-cosmos-e2e-matrix (core-deploy)
  • GitHub Check: cli-cross-chain-e2e-matrix (warp-apply)
  • GitHub Check: cli-cosmos-e2e-matrix (warp-deploy)
  • GitHub Check: cli-cosmos-e2e-matrix (warp-read)
  • GitHub Check: cosmos-sdk-e2e-run
  • GitHub Check: cli-cosmos-e2e-matrix (core-read)
  • GitHub Check: cli-cosmos-e2e-matrix (core-apply)
  • GitHub Check: cli-radix-e2e-matrix (core-deploy)
  • GitHub Check: cli-cosmos-e2e-matrix (core-check)
  • GitHub Check: cli-install-test-run
  • GitHub Check: build-and-push-to-gcr
  • GitHub Check: e2e-matrix (radix)
  • GitHub Check: e2e-matrix (sealevel)
  • GitHub Check: e2e-matrix (starknet)
  • GitHub Check: e2e-matrix (evm)
  • GitHub Check: e2e-matrix (cosmwasm)
  • GitHub Check: e2e-matrix (cosmosnative)
  • GitHub Check: lint-rs
  • GitHub Check: lander-coverage
  • GitHub Check: test-rs
🔇 Additional comments (4)
rust/main/agents/validator/src/validator.rs (4)

25-28: Imports look solid.

The addition of RPC_RETRY_SLEEP_DURATION is what you need for the retry backoff. Everything's in order here.


296-308: Good improvement to error handling.

Adding proper logging and retry behavior when fetching the merkle tree hook count makes the startup more resilient. The flow here makes sense—you're giving it time to recover before trying again.


310-323: Retry invocation looks good, fail-fast behavior is appropriate.

The control flow here does what you want—try multiple times to get the merkle tree hook sync going, and if it doesn't work after all attempts, bail out rather than run in a broken state. That's the right call for startup.


40-40: Retry count is appropriate—no changes needed.

The 10 attempts with 2-second intervals between retries gives roughly 18 seconds maximum wait time at startup, which provides solid resilience without being excessive. The implementation correctly uses the shared RPC_RETRY_SLEEP_DURATION constant for consistency. This strikes a good balance and doesn't require adjustment.

Comment thread rust/main/agents/validator/src/validator.rs Outdated
Comment thread rust/main/agents/validator/src/validator.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
rust/main/agents/validator/src/validator.rs (1)

335-364: Retry logic is solid, with room for minor polish.

The implementation correctly handles retries, logs appropriately, and manages critical error state. A couple of optional refinements if you're feeling particular:

  1. Attempt numbering: Logging attempt_count = i shows 0-9 instead of 1-10. Consider attempt_count = i + 1 for clarity.
  2. Unnecessary sleep: Line 350 sleeps even after the last failed attempt. Moving the sleep to the top of the loop or adding a check (if i < attempts - 1) would tighten things up, though the current approach is harmless.

These are nitpicks—the logic itself does what it needs to do.

Apply this diff if you want 1-indexed attempt counts:

                 Err(err) => {
                     error!(
                         ?err,
                         domain = self.origin_chain.name(),
-                        attempt_count = i,
+                        attempt_count = i + 1,
                         "Failed to run merkle tree hook sync"
                     );
                     sleep(RPC_RETRY_SLEEP_DURATION).await;
                     continue;
                 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7b194d and b7e47c0.

📒 Files selected for processing (1)
  • rust/main/agents/validator/src/validator.rs (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
rust/main/agents/validator/**/src/**/*.rs

📄 CodeRabbit inference engine (CLAUDE.md)

Maintain validator agent Rust sources under rust/main/agents/validator

Files:

  • rust/main/agents/validator/src/validator.rs
rust/main/**/src/**/*.rs

📄 CodeRabbit inference engine (CLAUDE.md)

Run cargo clippy for Rust code linting

Files:

  • rust/main/agents/validator/src/validator.rs
🧠 Learnings (2)
📚 Learning: 2025-09-02T18:44:06.598Z
Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/agents/validator/**/src/**/*.rs : Maintain validator agent Rust sources under rust/main/agents/validator

Applied to files:

  • rust/main/agents/validator/src/validator.rs
📚 Learning: 2025-09-02T18:44:06.598Z
Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}/**/src/**/*.rs : Keep chain support implementations within rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}

Applied to files:

  • rust/main/agents/validator/src/validator.rs
🔇 Additional comments (4)
rust/main/agents/validator/src/validator.rs (4)

25-27: Import looks good.

The RPC_RETRY_SLEEP_DURATION constant is properly imported and used for pacing retry attempts. Clean and straightforward.


40-40: Reasonable retry limit for startup.

Ten attempts is a solid middle ground—enough to handle transient failures without hanging startup indefinitely. If you find yourself needing more flexibility down the road, this could move into settings, but for now it does the job.


296-299: Verify indefinite retry behavior is intended.

Unlike the new retry logic below (which has a max attempt limit), this error path will loop forever if count() keeps failing. That's probably fine if you're expecting eventual consistency, but if the RPC is permanently misconfigured or the contract is broken, the validator just sits here spinning its wheels.

Worth confirming this matches your operational intent—do you want the validator to stay stuck here, or should there be some eventual bailout with critical error state?


310-323: Clean integration of retry logic.

The error path here properly aborts startup when all retry attempts are exhausted, which is the right call—no point running a validator that can't sync its merkle tree. Checkpoint submitters only start after successful sync initialization, maintaining proper sequencing.

@kamiyaa kamiyaa added this pull request to the merge queue Nov 7, 2025
Merged via the queue into main with commit a5c38ec Nov 7, 2025
97 of 102 checks passed
@kamiyaa kamiyaa deleted the jeff/build-cursor branch November 7, 2025 14:40
@github-project-automation github-project-automation Bot moved this from In Review to Done in Hyperlane Tasks Nov 7, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (69a0106) to head (b7e47c0).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff      @@
##   main   #7342   +/-   ##
============================
============================
Components Coverage Δ
core ∅ <ø> (∅)
hooks ∅ <ø> (∅)
isms ∅ <ø> (∅)
token ∅ <ø> (∅)
middlewares ∅ <ø> (∅)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants