feat: make validator startup more resilient by kamiyaa · Pull Request #7342 · hyperlane-xyz/hyperlane-monorepo

kamiyaa · 2025-11-06T20:27:23Z

Description

retry multiple times before giving up for validator startup

Related issues

fixes https://linear.app/hyperlane-xyz/issue/ENG-2369/make-cursor-building-more-resillient

Summary by CodeRabbit

Bug Fixes
- Added a retry-and-backoff mechanism when initializing the merkle tree hook to reduce startup failures.
- Improved logging for initialization attempts and failures, including attempt counts and domain context.
- Enhanced error handling to abort startup on repeated failures and to update critical-error metrics for faster detection.

changeset-bot · 2025-11-06T20:27:27Z

⚠️ No Changeset found

Latest commit: b7e47c0

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2025-11-06T20:31:24Z

📝 Walkthrough

Walkthrough

The change adds a retry wrapper to initialize the merkle tree hook sync: a new constant and helper attempt starting the task up to N times with sleep/backoff, updates logging and critical-error metrics on failures, and replaces the previous direct spawn flow in the main run loop.

Changes

Cohort / File(s)	Summary
Merkle Tree Hook Sync Retry Logic `rust/main/agents/validator/src/validator.rs`	Adds `CURSOR_INSTANTIATION_ATTEMPTS` and `try_n_times_to_run_merkle_tree_hook_sync` to retry spawning `merkle_tree_hook_sync` up to N times using `RPC_RETRY_SLEEP_DURATION`; replaces direct task spawn in the main run loop, adds domain-aware logging and critical-error metric updates, and adjusts error/sleep control flow when counting the merkle tree hook.

Sequence Diagram

sequenceDiagram
    participant Main as Main Run Loop
    participant Counter as Count Merkle Tree Hook
    participant Retry as try_n_times_to_run_merkle_tree_hook_sync
    participant Spawner as spawn merkle_tree_hook_sync Task

    Main->>Counter: Request hook count
    alt Count succeeds
        Counter-->>Main: Return count
        Main->>Retry: Invoke retry wrapper
        loop attempts (1..N)
            Retry->>Spawner: Try spawn task
            alt Spawn succeeds
                Spawner-->>Retry: Task running
                Retry-->>Main: Success (break)
            else Spawn fails
                Spawner-->>Retry: Error
                Retry->>Retry: Log failure, mark critical-error metric
                Retry->>Retry: Sleep (RPC_RETRY_SLEEP_DURATION)
            end
        end
        alt All attempts exhausted
            Retry-->>Main: Failure after N attempts
            Main->>Main: Log critical error & abort run
        end
    else Count fails
        Counter-->>Main: Error
        Main->>Main: Log error & sleep, then retry loop
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Check correctness of attempt counting and sleep/backoff usage.
Verify spawn error handling doesn't leak tasks or state and updates critical-error metrics in all failure branches.
Confirm logging provides domain and attempt context for debugging.

Possibly related PRs

feat: try to build cursor multiple times before giving up #7225 — Appears to add the same retry pattern for cursor/merkle-tree-hook instantiation, including a similar helper and constants.

Suggested reviewers

ameten

Poem

A stubborn hook in murky bog,
We prod, we nap, we try like slog.
With backoff breaths and measured tries,
The sync will rise before it dies. 🌲⚙️

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The description covers the main purpose but lacks several template sections: drive-by changes, backward compatibility, and testing details are missing or incomplete.	Add missing sections for drive-by changes, backward compatibility assessment, and testing methodology to complete the description template.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main change: adding retry logic to make validator startup more resilient, which aligns with the code changes introducing retry mechanisms.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch jeff/build-cursor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

rust/main/agents/validator/src/validator.rs (1)

335-364: Retry logic works, but consider capturing the last error for better debugging.

The retry mechanism does what it needs to—keeps trying until it succeeds or runs out of patience. The metrics handling is spot-on too. A few things that might make this even better:

The final error message at lines 360-363 doesn't include the last error that was encountered. When debugging failures, you'll want to know what actually went wrong on the last attempt.
The attempt_count starts at 0 (line 346), so logs show "attempt_count = 0" for the first try. Might be clearer to show "attempt 1 of 10" style.
All errors get the same treatment—network hiccups and config problems both trigger the same retry. That's probably fine for this use case since most failures will be transient RPC issues.

Here's a diff to capture the last error for better debugging:

 async fn try_n_times_to_run_merkle_tree_hook_sync(
     &self,
     attempts: usize,
 ) -> eyre::Result<JoinHandle<()>> {
+    let mut last_err = None;
     for i in 0..attempts {
         let task = match self.run_merkle_tree_hook_sync().await {
             Ok(s) => s,
             Err(err) => {
                 tracing::error!(
                     ?err,
                     domain = self.origin_chain.name(),
                     attempt_count = i,
                     "Failed to run merkle tree hook sync"
                 );
+                last_err = Some(err);
                 sleep(RPC_RETRY_SLEEP_DURATION).await;
                 continue;
             }
         };
         self.chain_metrics
             .set_critical_error(self.origin_chain.name(), false);
         return Ok(task);
     }
     self.chain_metrics
         .set_critical_error(self.origin_chain.name(), true);
-    Err(eyre::eyre!(
-        "Failed to initialize merkle tree hook sync after {} attempts",
-        attempts
-    ))
+    let base_msg = format!(
+        "Failed to initialize merkle tree hook sync after {} attempts",
+        attempts
+    );
+    if let Some(err) = last_err {
+        Err(eyre::eyre!("{}: {:?}", base_msg, err))
+    } else {
+        Err(eyre::eyre!(base_msg))
+    }
 }

And if you want to make the attempt count more intuitive:

             Err(err) => {
                 tracing::error!(
                     ?err,
                     domain = self.origin_chain.name(),
-                    attempt_count = i,
+                    attempt_count = i + 1,
+                    total_attempts = attempts,
                     "Failed to run merkle tree hook sync"
                 );
                 last_err = Some(err);
                 sleep(RPC_RETRY_SLEEP_DURATION).await;
                 continue;
             }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b872a03 and e7b194d.

📒 Files selected for processing (1)

rust/main/agents/validator/src/validator.rs (4 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

rust/main/agents/validator/**/src/**/*.rs

📄 CodeRabbit inference engine (CLAUDE.md)

Maintain validator agent Rust sources under rust/main/agents/validator

Files:

rust/main/agents/validator/src/validator.rs

rust/main/**/src/**/*.rs

📄 CodeRabbit inference engine (CLAUDE.md)

Run cargo clippy for Rust code linting

Files:

rust/main/agents/validator/src/validator.rs

🧠 Learnings (2)

📚 Learning: 2025-09-02T18:44:06.598Z

Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/agents/validator/**/src/**/*.rs : Maintain validator agent Rust sources under rust/main/agents/validator

Applied to files:

rust/main/agents/validator/src/validator.rs

📚 Learning: 2025-09-02T18:44:06.598Z

Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}/**/src/**/*.rs : Keep chain support implementations within rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}

Applied to files:

rust/main/agents/validator/src/validator.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (57)

GitHub Check: yarn-test
GitHub Check: cli-evm-e2e-matrix (warp-check-3)
GitHub Check: cli-evm-e2e-matrix (warp-rebalancer)
GitHub Check: cli-evm-e2e-matrix (warp-extend-config)
GitHub Check: cli-evm-e2e-matrix (warp-deploy-2)
GitHub Check: cli-evm-e2e-matrix (warp-read)
GitHub Check: cli-evm-e2e-matrix (warp-init)
GitHub Check: cli-evm-e2e-matrix (warp-extend-recovery)
GitHub Check: cli-evm-e2e-matrix (warp-deploy-1)
GitHub Check: cli-evm-e2e-matrix (warp-check-5)
GitHub Check: cli-evm-e2e-matrix (warp-send)
GitHub Check: cli-evm-e2e-matrix (warp-bridge-1)
GitHub Check: cli-evm-e2e-matrix (warp-check-4)
GitHub Check: cli-evm-e2e-matrix (warp-check-2)
GitHub Check: cli-evm-e2e-matrix (warp-extend-basic)
GitHub Check: cli-evm-e2e-matrix (relay)
GitHub Check: cli-evm-e2e-matrix (warp-bridge-2)
GitHub Check: cli-evm-e2e-matrix (warp-apply-submitters)
GitHub Check: cli-evm-e2e-matrix (core-init)
GitHub Check: cli-evm-e2e-matrix (core-read)
GitHub Check: cli-evm-e2e-matrix (warp-apply-1)
GitHub Check: cli-evm-e2e-matrix (warp-check-1)
GitHub Check: cli-evm-e2e-matrix (warp-apply-2)
GitHub Check: cli-evm-e2e-matrix (warp-apply-ism-updates)
GitHub Check: cli-evm-e2e-matrix (core-deploy)
GitHub Check: cli-evm-e2e-matrix (core-apply)
GitHub Check: cli-evm-e2e-matrix (core-check)
GitHub Check: env-test-matrix (testnet4, sepolia, core)
GitHub Check: env-test-matrix (mainnet3, optimism, core)
GitHub Check: env-test-matrix (mainnet3, inevm, igp)
GitHub Check: env-test-matrix (mainnet3, optimism, igp)
GitHub Check: env-test-matrix (mainnet3, arbitrum, igp)
GitHub Check: env-test-matrix (mainnet3, ethereum, igp)
GitHub Check: env-test-matrix (mainnet3, inevm, core)
GitHub Check: env-test-matrix (mainnet3, arbitrum, core)
GitHub Check: env-test-matrix (mainnet3, ethereum, core)
GitHub Check: cli-cross-chain-e2e-matrix (warp-deploy)
GitHub Check: cli-cosmos-e2e-matrix (core-deploy)
GitHub Check: cli-cross-chain-e2e-matrix (warp-apply)
GitHub Check: cli-cosmos-e2e-matrix (warp-deploy)
GitHub Check: cli-cosmos-e2e-matrix (warp-read)
GitHub Check: cosmos-sdk-e2e-run
GitHub Check: cli-cosmos-e2e-matrix (core-read)
GitHub Check: cli-cosmos-e2e-matrix (core-apply)
GitHub Check: cli-radix-e2e-matrix (core-deploy)
GitHub Check: cli-cosmos-e2e-matrix (core-check)
GitHub Check: cli-install-test-run
GitHub Check: build-and-push-to-gcr
GitHub Check: e2e-matrix (radix)
GitHub Check: e2e-matrix (sealevel)
GitHub Check: e2e-matrix (starknet)
GitHub Check: e2e-matrix (evm)
GitHub Check: e2e-matrix (cosmwasm)
GitHub Check: e2e-matrix (cosmosnative)
GitHub Check: lint-rs
GitHub Check: lander-coverage
GitHub Check: test-rs

🔇 Additional comments (4)

rust/main/agents/validator/src/validator.rs (4)

25-28: Imports look solid.

The addition of RPC_RETRY_SLEEP_DURATION is what you need for the retry backoff. Everything's in order here.

296-308: Good improvement to error handling.

Adding proper logging and retry behavior when fetching the merkle tree hook count makes the startup more resilient. The flow here makes sense—you're giving it time to recover before trying again.

310-323: Retry invocation looks good, fail-fast behavior is appropriate.

The control flow here does what you want—try multiple times to get the merkle tree hook sync going, and if it doesn't work after all attempts, bail out rather than run in a broken state. That's the right call for startup.

40-40: Retry count is appropriate—no changes needed.

The 10 attempts with 2-second intervals between retries gives roughly 18 seconds maximum wait time at startup, which provides solid resilience without being excessive. The implementation correctly uses the shared RPC_RETRY_SLEEP_DURATION constant for consistency. This strikes a good balance and doesn't require adjustment.

…e-monorepo into jeff/build-cursor

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

rust/main/agents/validator/src/validator.rs (1)
335-364: Retry logic is solid, with room for minor polish.

The implementation correctly handles retries, logs appropriately, and manages critical error state. A couple of optional refinements if you're feeling particular:

Attempt numbering: Logging attempt_count = i shows 0-9 instead of 1-10. Consider attempt_count = i + 1 for clarity.

Unnecessary sleep: Line 350 sleeps even after the last failed attempt. Moving the sleep to the top of the loop or adding a check (if i < attempts - 1) would tighten things up, though the current approach is harmless.

These are nitpicks—the logic itself does what it needs to do.

Apply this diff if you want 1-indexed attempt counts:
                 Err(err) => {
                     error!(
                         ?err,
                         domain = self.origin_chain.name(),
-                        attempt_count = i,
+                        attempt_count = i + 1,
                         "Failed to run merkle tree hook sync"
                     );
                     sleep(RPC_RETRY_SLEEP_DURATION).await;
                     continue;
                 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7b194d and b7e47c0.

📒 Files selected for processing (1)

rust/main/agents/validator/src/validator.rs (4 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

rust/main/agents/validator/**/src/**/*.rs

📄 CodeRabbit inference engine (CLAUDE.md)

Maintain validator agent Rust sources under rust/main/agents/validator

Files:

rust/main/agents/validator/src/validator.rs

rust/main/**/src/**/*.rs

📄 CodeRabbit inference engine (CLAUDE.md)

Run cargo clippy for Rust code linting

Files:

rust/main/agents/validator/src/validator.rs

🧠 Learnings (2)

📚 Learning: 2025-09-02T18:44:06.598Z

Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/agents/validator/**/src/**/*.rs : Maintain validator agent Rust sources under rust/main/agents/validator

Applied to files:

rust/main/agents/validator/src/validator.rs

📚 Learning: 2025-09-02T18:44:06.598Z

Learnt from: CR
Repo: hyperlane-xyz/hyperlane-monorepo PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-09-02T18:44:06.598Z
Learning: Applies to rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}/**/src/**/*.rs : Keep chain support implementations within rust/main/chains/{hyperlane-ethereum,hyperlane-cosmos,hyperlane-sealevel,hyperlane-fuel}

Applied to files:

rust/main/agents/validator/src/validator.rs

🔇 Additional comments (4)

rust/main/agents/validator/src/validator.rs (4)

25-27: Import looks good.

The RPC_RETRY_SLEEP_DURATION constant is properly imported and used for pacing retry attempts. Clean and straightforward.

40-40: Reasonable retry limit for startup.

Ten attempts is a solid middle ground—enough to handle transient failures without hanging startup indefinitely. If you find yourself needing more flexibility down the road, this could move into settings, but for now it does the job.

296-299: Verify indefinite retry behavior is intended.

Unlike the new retry logic below (which has a max attempt limit), this error path will loop forever if count() keeps failing. That's probably fine if you're expecting eventual consistency, but if the RPC is permanently misconfigured or the contract is broken, the validator just sits here spinning its wheels.

Worth confirming this matches your operational intent—do you want the validator to stay stuck here, or should there be some eventual bailout with critical error state?

310-323: Clean integration of retry logic.

The error path here properly aborts startup when all retry attempts are exhausted, which is the right call—no point running a validator that can't sync its merkle tree. Checkpoint submitters only start after successful sync initialization, maintaining proper sequencing.

codecov · 2025-11-07T14:41:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (69a0106) to head (b7e47c0).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@     Coverage Diff      @@
##   main   #7342   +/-   ##
============================
============================

Components	Coverage Δ
core	`∅ <ø> (∅)`
hooks	`∅ <ø> (∅)`
isms	`∅ <ø> (∅)`
token	`∅ <ø> (∅)`
middlewares	`∅ <ø> (∅)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

feat: make validator startup more resilient

e7b194d

kamiyaa requested review from ameten and yjamin as code owners November 6, 2025 20:27

github-project-automation Bot added this to Hyperlane Tasks Nov 6, 2025

github-project-automation Bot moved this to In Review in Hyperlane Tasks Nov 6, 2025

coderabbitai Bot reviewed Nov 6, 2025

View reviewed changes

Merge branch 'main' into jeff/build-cursor

4e48c70

ameten reviewed Nov 7, 2025

View reviewed changes

Comment thread rust/main/agents/validator/src/validator.rs Outdated

ameten reviewed Nov 7, 2025

View reviewed changes

Comment thread rust/main/agents/validator/src/validator.rs Outdated

kamiyaa added 3 commits November 7, 2025 08:29

feat: remove tracing

38c3bfb

Merge branch 'jeff/build-cursor' of github.com:hyperlane-xyz/hyperlan…

35387c4

…e-monorepo into jeff/build-cursor

Merge branch 'main' into jeff/build-cursor

b7e47c0

coderabbitai Bot reviewed Nov 7, 2025

View reviewed changes

ameten approved these changes Nov 7, 2025

View reviewed changes

kamiyaa added this pull request to the merge queue Nov 7, 2025

Merged via the queue into main with commit a5c38ec Nov 7, 2025
97 of 102 checks passed

kamiyaa deleted the jeff/build-cursor branch November 7, 2025 14:40

github-project-automation Bot moved this from In Review to Done in Hyperlane Tasks Nov 7, 2025

github-actions Bot mentioned this pull request Nov 7, 2025

chore: release agents v1.7.0 #7345

Merged

yorhodes pushed a commit that referenced this pull request Nov 14, 2025

feat: make validator startup more resilient (#7342)

9f64209

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make validator startup more resilient#7342

feat: make validator startup more resilient#7342
kamiyaa merged 5 commits intomainfrom
jeff/build-cursor

kamiyaa commented Nov 6, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

changeset-bot Bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot commented Nov 6, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kamiyaa commented Nov 6, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Summary by CodeRabbit

Uh oh!

changeset-bot Bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Nov 7, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kamiyaa commented Nov 6, 2025 •

edited by coderabbitai Bot

Loading

changeset-bot Bot commented Nov 6, 2025 •

edited

Loading

coderabbitai Bot commented Nov 6, 2025 •

edited

Loading