Skip to content

feat(kms-connector): improved nonce manager#1112

Closed
eudelins-zama wants to merge 2 commits intomainfrom
simon/feat/improved-nonce-manager
Closed

feat(kms-connector): improved nonce manager#1112
eudelins-zama wants to merge 2 commits intomainfrom
simon/feat/improved-nonce-manager

Conversation

@eudelins-zama
Copy link
Copy Markdown
Contributor

@eudelins-zama eudelins-zama commented Oct 21, 2025

Huge kudos to @nboisde for this one 🙏

Closes https://github.com/zama-ai/fhevm-internal/issues/672
Closes https://github.com/zama-ai/fhevm-internal/issues/671

Code changes

No longer rely on a CachedNonceManager from alloy, but from a custom one (also used in the relayer) that keeps track of which nonce were successfully used, which are free to be reused in case of failure etc.

This should reduce the number of nonce too high/low error caught in our logs.
With the current setup, if the tx-sender get a nonce too high/low error, it will reinit the CachedNonceManager via RPC call. The problem with that is that in general, there are multiple of these errors, so each task handling a tx will do this RPC call, and the CachedNonceManager willl provide the same nonce to all of these tasks (because of the RPC call), instead of an incremented one.
With this new setup, no RPC call is made on error, the failed nonce is just stored in a map to be reused by another tx later on.

Risk

The main risk I see is that there is that the nonce manager never re-sync with the RPC node on error. So it could result in a dead lock if for some reason, the nonce manager is stuck with too high nonces...

@eudelins-zama eudelins-zama requested a review from Copilot October 21, 2025 13:56
@eudelins-zama eudelins-zama self-assigned this Oct 21, 2025
@cla-bot cla-bot bot added the cla-signed label Oct 21, 2025
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a custom nonce management system (ZamaNonceManager) to replace the existing CachedNonceManager. The new implementation provides thread-safe, gap-filling nonce allocation with support for releasing and confirming nonces, enabling better handling of stuck transactions and concurrent nonce requests.

Key changes:

  • Implements ZamaNonceManager with support for nonce gaps, locking, and release/confirm operations
  • Replaces CachedNonceManager with ZamaNonceManager in NonceManagedProvider
  • Adds granular error handling to distinguish nonce errors from other transaction failures

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
kms-connector/crates/utils/src/provider/nonce_manager.rs New module implementing the custom nonce manager with gap-filling logic and comprehensive test coverage
kms-connector/crates/utils/src/provider/mod.rs Integration of ZamaNonceManager, removal of CachedNonceManager, and addition of nonce error detection
kms-connector/crates/utils/Cargo.toml Added dashmap dependency and anvil testing features for alloy

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@eudelins-zama eudelins-zama force-pushed the simon/feat/improved-nonce-manager branch 6 times, most recently from bb37691 to 345884c Compare October 24, 2025 12:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

kms-connector/crates/utils/src/provider/mod.rs:39

  • The documentation comment is outdated. The provider no longer "recovers its nonce manager on error" (which was the old behavior of resetting the nonce manager). Instead, it now uses a sophisticated nonce management system that releases or confirms nonces based on transaction outcomes.

Suggested update:

/// A wrapper around an `alloy` provider that manages transaction nonces intelligently.
///
/// This provider tracks nonces across transaction lifecycles, releasing failed nonces for
/// reuse and confirming successful ones. It fills gaps in the nonce sequence when possible.
/// 
/// Note that the provider given by the user must not have nonce management enabled, as this
/// is done by the `NonceManagedProvider` itself.
/// Users can use the default `FillersWithoutNonceManagement` to create a provider.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mergify
Copy link
Copy Markdown

mergify bot commented Dec 15, 2025

🧪 CI Insights

Here's what we observed from your CI run for c3d7b59.

🟢 All jobs passed!

But CI Insights is watching 👀

@eudelins-zama eudelins-zama force-pushed the simon/feat/improved-nonce-manager branch from 1cc941c to c3d7b59 Compare December 17, 2025 13:14
@eudelins-zama eudelins-zama marked this pull request as ready for review December 17, 2025 13:30
@eudelins-zama eudelins-zama requested review from a team and dartdart26 as code owners December 17, 2025 13:30
Copy link
Copy Markdown
Collaborator

@dartdart26 dartdart26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I've put some comments that you might address now or not, not critical.

Also, maybe you can elaborate on your concern about the manager not syncing often with the network - is it a concern?

Was wondering - does getting occasional nonce errors create any serious issue? Do we observe issues on prod?


/// A robust, in-memory nonce manager for a scalable transaction engine.
#[derive(Clone, Debug, Default)]
pub struct ZamaNonceManager {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of the name - maybe it should say what it does and not name it Zama.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to come up with a better one

const ETH_INVALID_INPUT_RPC_ERROR_CODE: i64 = -32000;

/// Returns `true` if the RPC error is "nonce too low" or "already known", `false` otherwise.
fn is_nonce_too_low(error: &RpcError<TransportErrorKind>) -> bool {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit uneasy about this function - do we know for a fact that it handles all cases of nonce too low, will all RPC nodes be consistent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No indeed. But I don't think we have another way for now...

There's only this spec, but it too generic for what we want to achieve

It is mentioned in quicknode docs but it's not a guarantee that all RPC nodes will do it this way.

Wdyt?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put it differently - what would happen if this function doesn't work as intended and misses nonce errors?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, the nonce would be stored in the available_nonces for later reuse instead of being confirmed. As the nonce is too low, this will cause many useless errors

@eudelins-zama
Copy link
Copy Markdown
Contributor Author

Was wondering - does getting occasional nonce errors create any serious issue? Do we observe issues on prod?

I would not say serious issue. But it will create (lots of) useless retries if the Gateway cache is full of txs, because of this (if not clear let me know):

if the tx-sender get a nonce too high/low error, it will reinit the CachedNonceManager via RPC call. The problem with that is that in general, there are multiple of these errors, so each task handling a tx will do this RPC call, and the CachedNonceManager willl provide the same nonce to all of these tasks (because of the RPC call), instead of an incremented one.

@eudelins-zama
Copy link
Copy Markdown
Contributor Author

Also, maybe you can elaborate on your concern about the manager not syncing often with the network - is it a concern?

Let's say we send tx1, tx2 and tx3 with nonce 1, 2 and 3. RPC node returns OK for tx1 and tx2 and an error for tx3.

Now if for some reason the OK returned by the RPC node for nonce 2 was a "false positive" and that in reality, this nonce is not consumed, then when the tx-sender will retry tx3, it will try to send it with nonce 3 instead of 2, and this forever as the NonceManager never "re-sync" with the RPC node

I tried to stress tests the NonceManager on devnet to see if I could produce this scenario, but no, I did not manage to.

In theory, this should not be possible, but well, we never know...

@Eikix
Copy link
Copy Markdown
Contributor

Eikix commented Feb 12, 2026

Is this PR stale or still ongoing?

@eudelins-zama
Copy link
Copy Markdown
Contributor Author

Closing for now as we don't currently have issue management on mainnet, so the risk is not worth it.

I'll reopen it if that changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants