Make IP resolution lazy and self-contained; remove ResolveIps step#12050
Make IP resolution lazy and self-contained; remove ResolveIps step#12050asdacap wants to merge 4 commits into
Conversation
Collapse IIPResolver to a single cached Resolve() returning a NethermindIp(LocalIp, ExternalIp) record. Resolution runs once (lock-guarded for concurrent callers) and no longer mutates INetworkConfig; explicit config overrides are still honored. Migrate all consumers (SetupKeyStore, RlpxHost, DiscoveryApp, DiscoveryConnectionsPool, DiscoveryV5App, NodeRecordProvider, HealthChecksWebhookInfo, Optimism/Shutter plugins) to Resolve(), and delete the now-redundant ResolveIps init step and its step dependencies. ExternalIp/LocalIp config items are documented as override-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Nest NethermindIp as an inner record of IIPResolver (was a top-level type). - Make OptimismPlugin.InitRpcModules async and await Resolve() instead of blocking. - Resolve the IP inside ShutterP2P.Start (async) by passing IIPResolver through ShutterApi/ShutterP2P, removing the blocking resolve in the plugin factory. - RlpxHost: store the resolved local IPAddress directly and drop the private LocalIp string round-trip; bind to it in Init(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace NodeRecord Current { get; } with ValueTask<NodeRecord> GetCurrentAsync(CancellationToken). Its only consumer, KademliaDiscv4Adapter (Ping and HandleEnrRequest), is already async, so the node record is now built by awaiting IIPResolver.Resolve() instead of blocking with GetAwaiter().GetResult().
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @asdacap's task in 6m 25s —— View job PR Review
SummaryThe refactor cleanly removes the brittle FindingsHigh1. 2. Medium3. Silent exception swallowing in 4. Blocking Low5. Dead The refactor goal is good and the non-blocking consumers ( |
|
|
||
| NodeRecord? _nodeRecord = null; | ||
| public NodeRecord Current => _nodeRecord ??= PrepareNodeRecord(); | ||
| public async ValueTask<NodeRecord> GetCurrentAsync(CancellationToken cancellationToken = default) |
There was a problem hiding this comment.
High — Race condition: ??= is not atomic for async ValueTask methods.
KademliaDiscv4Adapter.Ping() and HandleEnrRequest() are both called concurrently from incoming discovery message handlers. When multiple callers see _nodeRecord is null simultaneously, each will independently run PrepareNodeRecord — signing the ENR multiple times and ending up with different NodeRecord instances in flight. Only the last assignment wins, but callers that received an earlier result have a dangling NodeRecord that is no longer the canonical instance.
Compare with IPResolver, which uses Lock + _resolveTask to guard exactly this pattern. Apply the same approach:
private readonly Lock _lock = new();
private Task<NodeRecord>? _nodeRecordTask;
public ValueTask<NodeRecord> GetCurrentAsync(CancellationToken cancellationToken = default)
{
Task<NodeRecord>? task = Volatile.Read(ref _nodeRecordTask);
if (task is not null) return new ValueTask<NodeRecord>(task);
lock (_lock)
{
task = _nodeRecordTask ??= PrepareNodeRecord(cancellationToken);
}
return new ValueTask<NodeRecord>(task);
}|
|
||
| lock (_lock) | ||
| { | ||
| task = _resolveTask ??= ResolveCore(cancellationToken); |
There was a problem hiding this comment.
High — First caller's CancellationToken is permanently baked into _resolveTask.
_resolveTask ??= ResolveCore(cancellationToken) means whichever caller wins the lock owns the token used for all retries. If that first caller passes a short-lived or already-cancelled token, ResolveExternalIp will throw OperationCanceledException during a Task.Delay between retries, permanently faulting _resolveTask. Every subsequent Resolve() call then immediately gets a faulted ValueTask — the resolver is stuck with no ability to retry.
SetupKeyStore currently holds this distinction (it's the first caller in startup). If the node's init CancellationToken fires before all 5 retries complete, IP resolution is permanently broken for the lifetime of the process.
Fix: start resolution unconditionally with CancellationToken.None so the cached result isn't bound to a single caller's lifetime:
lock (_lock)
{
task = _resolveTask ??= ResolveCore(CancellationToken.None);
}Per-call cancellation for the await is unaffected — callers can still wrap their own await with a timeout. The CT passed to Resolve() becomes meaningful only for cache-miss callers who want to cancel waiting (not the underlying resolution), and that could be documented in the summary comment on IIPResolver.Resolve.
| catch (Exception) | ||
| { | ||
| LocalIp = IPAddress.Loopback; | ||
| localIp = IPAddress.Loopback; |
There was a problem hiding this comment.
Medium — Silent exception swallowing violates robustness rules.
The empty catch (Exception) falls back to IPAddress.Loopback without logging anything. Per the robustness rules: "Never swallow exceptions in an empty catch block — at minimum log the exception. Silent failures are the hardest to diagnose on a running node."
catch (Exception e)
{
if (_logger.IsWarn) _logger.Warn($"Could not resolve local IP, falling back to loopback: {e.Message}");
localIp = IPAddress.Loopback;
}| _legacyDiscoveryDb = legacyDiscoveryDb; | ||
| _logManager = logManager; | ||
| _allowNonRoutableEnrs = ShouldAcceptNonRoutableEnrs(ipResolver.ExternalIp); | ||
| IPAddress externalIp = ipResolver.Resolve().GetAwaiter().GetResult().ExternalIp; |
There was a problem hiding this comment.
Medium — Blocking .GetAwaiter().GetResult() in a constructor.
This blocks the DI resolution thread on external HTTP calls if the IP cache hasn't been warmed yet. It's safe in the current startup ordering because InitializeNetwork (which resolves DiscoveryV5App) declares SetupKeyStore as a RunnerStepDependencies, so the cache should already be populated. But this implicit dependency is invisible to the compiler and fragile: a future refactor that resolves IDiscoveryApp earlier (e.g., in a plugin Init() method) would block without any warning.
The externalIp is only used to compute _allowNonRoutableEnrs and populate the ENR builder — both one-time setup operations. The same pattern applied to RlpxHost (which uses Lazy<IPeerPool> to defer resolution) could be applied here: move the IP resolution into StartAsync() and accept the field as IPAddress? with a NotNullAfterStart invariant, or accept IIPResolver and resolve lazily.
If this is intentional and the startup ordering is considered a firm invariant, add an explicit // Called after SetupKeyStore warms the IP cache; GetAwaiter().GetResult() is safe here. comment so future maintainers understand the constraint.
| LocalNodeId = nodeKey.PublicKey; | ||
| LocalPort = networkConfig.P2PPort; | ||
| LocalIp = networkConfig.LocalIp; | ||
| IIPResolver.NethermindIp ips = ipResolver.Resolve().GetAwaiter().GetResult(); |
There was a problem hiding this comment.
Medium — Same blocking .GetAwaiter().GetResult() concern as DiscoveryV5App.
RlpxHost is slightly safer than DiscoveryV5App here because it's injected as Lazy<IPeerPool> into InitializeNetwork (the comment on line 71 of InitializeNetwork.cs explains this). So its constructor runs lazily, after the step's async Initialize() method has started — by which point SetupKeyStore has already awaited Resolve() and warmed the cache.
Still, the same implicit ordering caveat applies. Consider at minimum adding a comment here:
// SetupKeyStore.Execute() is a declared dependency of InitializeNetwork and awaits
// Resolve() before this constructor runs, so the cache is guaranteed warm.
IIPResolver.NethermindIp ips = ipResolver.Resolve().GetAwaiter().GetResult();| @@ -19,17 +19,39 @@ public class IPResolver(INetworkConfig networkConfig, ILogManager logManager) : | |||
| private readonly INetworkConfig _networkConfig = networkConfig ?? throw new ArgumentNullException(nameof(networkConfig)); | |||
| private readonly ILogManager _logManager = logManager; | |||
There was a problem hiding this comment.
Low — _logManager field is stored but never used at class level.
_logger (line 18) covers all direct logging. _logManager is only forwarded to IP source constructors within InitializeExternalIp() / InitializeLocalIp(). Since those are private methods with access to the constructor parameter, you can either pass the parameter directly without storing it as a field, or remove the field and use the local logManager parameter in the constructor-captured closure. As-is it's a dead field that adds noise.
- NodeRecordProvider/IPResolver: guard the cached resolution with Lock + cached Task; resolve with CancellationToken.None so a single caller can't fault the shared result, and honor per-call cancellation via WaitAsync. - IPResolver: log the previously-swallowed local-IP resolution exception; drop the unused _logManager field. - DiscoveryV5App/RlpxHost: document why the constructor Resolve().GetAwaiter().GetResult() is safe (cache warmed by SetupKeyStore, a declared InitializeNetwork dependency). - Remove redundant System.Threading/System.Threading.Tasks usings in the Discovery files (IDE0005; the project has ImplicitUsings enabled). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Changes
Collapse
IIPResolverto a single cached, thread-safe call and make IP resolution self-contained, removing the fragileResolveIpsinit step.IIPResolvernow exposes a singleValueTask<IIPResolver.NethermindIp> Resolve(CancellationToken)(withNethermindIp(LocalIp, ExternalIp)as a nested record). Resolution runs once, is cached, and locks concurrent callers. It still honors an explicitNetwork.ExternalIp/LocalIpoverride but no longer writes back intoINetworkConfig.ResolveIpsstep (and itsRunnerStepDependencies). It previously mutatedNetworkConfig.ExternalIp/LocalIpafter running, and consumers relied on that — a brittle ordering dependency (e.g.DiscoveryConnectionsPoolparsedNetworkConfig.LocalIp!in a field initializer and would throw if the step hadn't run).Resolve():SetupKeyStore,RlpxHost,DiscoveryApp,DiscoveryConnectionsPool,DiscoveryV5App,NodeRecordProvider,HealthChecksWebhookInfo, and the Optimism/Shutter plugins. Async call sitesawait; the few genuinely synchronous sites use the warmed cache.INodeRecordProvider.Current→ValueTask<NodeRecord> GetCurrentAsync(CancellationToken)so its (async) consumer awaits resolution instead of blocking.ShutterP2P.Start(async) instead of blocking during construction.Network.ExternalIp/LocalIpas override-only; the only remaining direct readers are the twoNetworkConfig*IPSourceoverride sources.Types of changes
What types of changes does your code introduce?
Testing
Requires testing
If yes, did you write tests?
Notes on testing
Built
Nethermind.slnxin release (0 warnings, 0 errors) and ran the affected test projects:Nethermind.Network.Discovery.Test(incl.IPResolverTests, Discv5, E2E discovery),Nethermind.Network.Test,Nethermind.HealthChecks.Test,Nethermind.Shutter.Test, and theNethermind.Runner.TestEthereumRunnerTestssmoke test — all green.dotnet format whitespaceverifies clean.Documentation
Requires documentation update
Requires explanation in Release Notes
Remarks
Note for plugin authors: the
IIPResolverandINodeRecordProviderinterface shapes changed (the former dropsLocalIp/ExternalIp/Initializein favor ofResolve(); the latter replaces theCurrentproperty withGetCurrentAsync()). Runtime behavior is preserved — config overrides are still honored and the enode/ENR are still built during init.🤖 Generated with Claude Code