feat(kms-connector): improved nonce manager#1112
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a custom nonce management system (ZamaNonceManager) to replace the existing CachedNonceManager. The new implementation provides thread-safe, gap-filling nonce allocation with support for releasing and confirming nonces, enabling better handling of stuck transactions and concurrent nonce requests.
Key changes:
- Implements
ZamaNonceManagerwith support for nonce gaps, locking, and release/confirm operations - Replaces
CachedNonceManagerwithZamaNonceManagerinNonceManagedProvider - Adds granular error handling to distinguish nonce errors from other transaction failures
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| kms-connector/crates/utils/src/provider/nonce_manager.rs | New module implementing the custom nonce manager with gap-filling logic and comprehensive test coverage |
| kms-connector/crates/utils/src/provider/mod.rs | Integration of ZamaNonceManager, removal of CachedNonceManager, and addition of nonce error detection |
| kms-connector/crates/utils/Cargo.toml | Added dashmap dependency and anvil testing features for alloy |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
bb37691 to
345884c
Compare
345884c to
c1e40b4
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
kms-connector/crates/utils/src/provider/mod.rs:39
- The documentation comment is outdated. The provider no longer "recovers its nonce manager on error" (which was the old behavior of resetting the nonce manager). Instead, it now uses a sophisticated nonce management system that releases or confirms nonces based on transaction outcomes.
Suggested update:
/// A wrapper around an `alloy` provider that manages transaction nonces intelligently.
///
/// This provider tracks nonces across transaction lifecycles, releasing failed nonces for
/// reuse and confirming successful ones. It fills gaps in the nonce sequence when possible.
///
/// Note that the provider given by the user must not have nonce management enabled, as this
/// is done by the `NonceManagedProvider` itself.
/// Users can use the default `FillersWithoutNonceManagement` to create a provider.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c1e40b4 to
594e9e2
Compare
594e9e2 to
1cc941c
Compare
🧪 CI InsightsHere's what we observed from your CI run for c3d7b59. 🟢 All jobs passed!But CI Insights is watching 👀 |
1cc941c to
c3d7b59
Compare
dartdart26
left a comment
There was a problem hiding this comment.
Looks good!
I've put some comments that you might address now or not, not critical.
Also, maybe you can elaborate on your concern about the manager not syncing often with the network - is it a concern?
Was wondering - does getting occasional nonce errors create any serious issue? Do we observe issues on prod?
|
|
||
| /// A robust, in-memory nonce manager for a scalable transaction engine. | ||
| #[derive(Clone, Debug, Default)] | ||
| pub struct ZamaNonceManager { |
There was a problem hiding this comment.
Not a big fan of the name - maybe it should say what it does and not name it Zama.
There was a problem hiding this comment.
Will try to come up with a better one
| const ETH_INVALID_INPUT_RPC_ERROR_CODE: i64 = -32000; | ||
|
|
||
| /// Returns `true` if the RPC error is "nonce too low" or "already known", `false` otherwise. | ||
| fn is_nonce_too_low(error: &RpcError<TransportErrorKind>) -> bool { |
There was a problem hiding this comment.
I am a bit uneasy about this function - do we know for a fact that it handles all cases of nonce too low, will all RPC nodes be consistent?
There was a problem hiding this comment.
No indeed. But I don't think we have another way for now...
There's only this spec, but it too generic for what we want to achieve
It is mentioned in quicknode docs but it's not a guarantee that all RPC nodes will do it this way.
Wdyt?
There was a problem hiding this comment.
To put it differently - what would happen if this function doesn't work as intended and misses nonce errors?
There was a problem hiding this comment.
Then, the nonce would be stored in the available_nonces for later reuse instead of being confirmed. As the nonce is too low, this will cause many useless errors
I would not say serious issue. But it will create (lots of) useless retries if the Gateway cache is full of txs, because of this (if not clear let me know):
|
Let's say we send tx1, tx2 and tx3 with nonce 1, 2 and 3. RPC node returns OK for tx1 and tx2 and an error for tx3. Now if for some reason the OK returned by the RPC node for nonce 2 was a "false positive" and that in reality, this nonce is not consumed, then when the tx-sender will retry tx3, it will try to send it with nonce 3 instead of 2, and this forever as the NonceManager never "re-sync" with the RPC node I tried to stress tests the NonceManager on devnet to see if I could produce this scenario, but no, I did not manage to. In theory, this should not be possible, but well, we never know... |
|
Is this PR stale or still ongoing? |
|
Closing for now as we don't currently have issue management on mainnet, so the risk is not worth it. I'll reopen it if that changes. |
Huge kudos to @nboisde for this one 🙏
Closes https://github.com/zama-ai/fhevm-internal/issues/672
Closes https://github.com/zama-ai/fhevm-internal/issues/671
Code changes
No longer rely on a
CachedNonceManagerfromalloy, but from a custom one (also used in therelayer) that keeps track of which nonce were successfully used, which are free to be reused in case of failure etc.This should reduce the number of
nonce too high/lowerror caught in our logs.With the current setup, if the tx-sender get a
nonce too high/lowerror, it will reinit theCachedNonceManagervia RPC call. The problem with that is that in general, there are multiple of these errors, so each task handling a tx will do this RPC call, and theCachedNonceManagerwilll provide the same nonce to all of these tasks (because of the RPC call), instead of an incremented one.With this new setup, no RPC call is made on error, the failed nonce is just stored in a map to be reused by another tx later on.
Risk
The main risk I see is that there is that the nonce manager never re-sync with the RPC node on error. So it could result in a dead lock if for some reason, the nonce manager is stuck with too high nonces...