Skip to content

Make IP resolution lazy and self-contained; remove ResolveIps step#12050

Open
asdacap wants to merge 4 commits into
masterfrom
find-non-di-nethermindapi-property
Open

Make IP resolution lazy and self-contained; remove ResolveIps step#12050
asdacap wants to merge 4 commits into
masterfrom
find-non-di-nethermindapi-property

Conversation

@asdacap

@asdacap asdacap commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Changes

Collapse IIPResolver to a single cached, thread-safe call and make IP resolution self-contained, removing the fragile ResolveIps init step.

  • IIPResolver now exposes a single ValueTask<IIPResolver.NethermindIp> Resolve(CancellationToken) (with NethermindIp(LocalIp, ExternalIp) as a nested record). Resolution runs once, is cached, and locks concurrent callers. It still honors an explicit Network.ExternalIp/LocalIp override but no longer writes back into INetworkConfig.
  • Removed the ResolveIps step (and its RunnerStepDependencies). It previously mutated NetworkConfig.ExternalIp/LocalIp after running, and consumers relied on that — a brittle ordering dependency (e.g. DiscoveryConnectionsPool parsed NetworkConfig.LocalIp! in a field initializer and would throw if the step hadn't run).
  • Migrated all consumers to Resolve(): SetupKeyStore, RlpxHost, DiscoveryApp, DiscoveryConnectionsPool, DiscoveryV5App, NodeRecordProvider, HealthChecksWebhookInfo, and the Optimism/Shutter plugins. Async call sites await; the few genuinely synchronous sites use the warmed cache.
  • INodeRecordProvider.CurrentValueTask<NodeRecord> GetCurrentAsync(CancellationToken) so its (async) consumer awaits resolution instead of blocking.
  • Shutter resolves its libp2p address inside ShutterP2P.Start (async) instead of blocking during construction.
  • Documented Network.ExternalIp/LocalIp as override-only; the only remaining direct readers are the two NetworkConfig*IPSource override sources.

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

Notes on testing

Built Nethermind.slnx in release (0 warnings, 0 errors) and ran the affected test projects: Nethermind.Network.Discovery.Test (incl. IPResolverTests, Discv5, E2E discovery), Nethermind.Network.Test, Nethermind.HealthChecks.Test, Nethermind.Shutter.Test, and the Nethermind.Runner.Test EthereumRunnerTests smoke test — all green. dotnet format whitespace verifies clean.

Documentation

Requires documentation update

  • Yes
  • No

Requires explanation in Release Notes

  • Yes
  • No

Remarks

Note for plugin authors: the IIPResolver and INodeRecordProvider interface shapes changed (the former drops LocalIp/ExternalIp/Initialize in favor of Resolve(); the latter replaces the Current property with GetCurrentAsync()). Runtime behavior is preserved — config overrides are still honored and the enode/ENR are still built during init.

🤖 Generated with Claude Code

asdacap and others added 3 commits June 18, 2026 10:07
Collapse IIPResolver to a single cached Resolve() returning a NethermindIp(LocalIp, ExternalIp) record. Resolution runs once (lock-guarded for concurrent callers) and no longer mutates INetworkConfig; explicit config overrides are still honored.

Migrate all consumers (SetupKeyStore, RlpxHost, DiscoveryApp, DiscoveryConnectionsPool, DiscoveryV5App, NodeRecordProvider, HealthChecksWebhookInfo, Optimism/Shutter plugins) to Resolve(), and delete the now-redundant ResolveIps init step and its step dependencies. ExternalIp/LocalIp config items are documented as override-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Nest NethermindIp as an inner record of IIPResolver (was a top-level type).

- Make OptimismPlugin.InitRpcModules async and await Resolve() instead of blocking.

- Resolve the IP inside ShutterP2P.Start (async) by passing IIPResolver through ShutterApi/ShutterP2P, removing the blocking resolve in the plugin factory.

- RlpxHost: store the resolved local IPAddress directly and drop the private LocalIp string round-trip; bind to it in Init().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace NodeRecord Current { get; } with ValueTask<NodeRecord> GetCurrentAsync(CancellationToken). Its only consumer, KademliaDiscv4Adapter (Ping and HandleEnrRequest), is already async, so the node record is now built by awaiting IIPResolver.Resolve() instead of blocking with GetAwaiter().GetResult().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Claude finished @asdacap's task in 6m 25s —— View job


PR Review

  • Gather context and understand the request
  • Read relevant rule files
  • Review changed files
  • Check for correctness, edge cases, security, performance
  • Post inline comments for specific issues
  • Write summary

Summary

The refactor cleanly removes the brittle ResolveIps init step and gives IIPResolver a proper async, cached interface. The approach is sound and the migration to async consumers is mostly well-executed. Two correctness issues need addressing before merge.

Findings

High

1. NodeRecordProvider.GetCurrentAsync — race condition (comment)
_nodeRecord ??= await PrepareNodeRecord(cancellationToken) is not atomic. KademliaDiscv4Adapter.Ping() and HandleEnrRequest() are driven by concurrent discovery message handlers, so multiple callers can each see _nodeRecord is null and each invoke PrepareNodeRecord independently — signing the ENR multiple times, producing different NodeRecord instances. By contrast, IPResolver correctly guards its cache with Lock + Volatile.Read. NodeRecordProvider needs the same pattern.

2. IPResolver._resolveTask bakes in the first caller's CancellationToken (comment)
_resolveTask ??= ResolveCore(cancellationToken) stores the first caller's CT inside the long-lived cached task. If that CT is cancelled before the five external-IP retries complete (e.g., node shutdown during init), _resolveTask is permanently faulted and every future Resolve() call returns an exception with no retry path. The fix is to call ResolveCore(CancellationToken.None) so the cached resolution is never tied to one caller's lifetime.

Medium

3. Silent exception swallowing in IPResolver.ResolveCore (comment)
The catch (Exception) block for local IP resolution falls back to IPAddress.Loopback without logging anything. The robustness rules require "at minimum log the exception".

4. Blocking .GetAwaiter().GetResult() in constructors (DiscoveryV5App comment, RlpxHost comment)
These constructors rely on the IP cache already being warm at resolution time due to startup ordering (InitializeNetwork depends on SetupKeyStore). That implicit invariant is invisible to future maintainers. DiscoveryV5App is the riskier case since it's not wrapped in a Lazy<> — if any earlier code path ever resolves IDiscoveryApp before SetupKeyStore completes, the constructor will block on external HTTP calls. At minimum, add comments documenting why .GetAwaiter().GetResult() is safe here.

Low

5. Dead _logManager field in IPResolver (comment)
_logManager is stored as a field but _logger covers all direct logging; _logManager is only forwarded to IP source constructors within private methods. The field can be removed.


The refactor goal is good and the non-blocking consumers (SetupKeyStore, DiscoveryApp, NodeRecordProvider) are properly migrated. Issues 1 and 2 are the blockers.


NodeRecord? _nodeRecord = null;
public NodeRecord Current => _nodeRecord ??= PrepareNodeRecord();
public async ValueTask<NodeRecord> GetCurrentAsync(CancellationToken cancellationToken = default)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — Race condition: ??= is not atomic for async ValueTask methods.

KademliaDiscv4Adapter.Ping() and HandleEnrRequest() are both called concurrently from incoming discovery message handlers. When multiple callers see _nodeRecord is null simultaneously, each will independently run PrepareNodeRecord — signing the ENR multiple times and ending up with different NodeRecord instances in flight. Only the last assignment wins, but callers that received an earlier result have a dangling NodeRecord that is no longer the canonical instance.

Compare with IPResolver, which uses Lock + _resolveTask to guard exactly this pattern. Apply the same approach:

private readonly Lock _lock = new();
private Task<NodeRecord>? _nodeRecordTask;

public ValueTask<NodeRecord> GetCurrentAsync(CancellationToken cancellationToken = default)
{
    Task<NodeRecord>? task = Volatile.Read(ref _nodeRecordTask);
    if (task is not null) return new ValueTask<NodeRecord>(task);
    lock (_lock)
    {
        task = _nodeRecordTask ??= PrepareNodeRecord(cancellationToken);
    }
    return new ValueTask<NodeRecord>(task);
}

Fix this →


lock (_lock)
{
task = _resolveTask ??= ResolveCore(cancellationToken);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — First caller's CancellationToken is permanently baked into _resolveTask.

_resolveTask ??= ResolveCore(cancellationToken) means whichever caller wins the lock owns the token used for all retries. If that first caller passes a short-lived or already-cancelled token, ResolveExternalIp will throw OperationCanceledException during a Task.Delay between retries, permanently faulting _resolveTask. Every subsequent Resolve() call then immediately gets a faulted ValueTask — the resolver is stuck with no ability to retry.

SetupKeyStore currently holds this distinction (it's the first caller in startup). If the node's init CancellationToken fires before all 5 retries complete, IP resolution is permanently broken for the lifetime of the process.

Fix: start resolution unconditionally with CancellationToken.None so the cached result isn't bound to a single caller's lifetime:

lock (_lock)
{
    task = _resolveTask ??= ResolveCore(CancellationToken.None);
}

Per-call cancellation for the await is unaffected — callers can still wrap their own await with a timeout. The CT passed to Resolve() becomes meaningful only for cache-miss callers who want to cancel waiting (not the underlying resolution), and that could be documented in the summary comment on IIPResolver.Resolve.

Fix this →

catch (Exception)
{
LocalIp = IPAddress.Loopback;
localIp = IPAddress.Loopback;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Silent exception swallowing violates robustness rules.

The empty catch (Exception) falls back to IPAddress.Loopback without logging anything. Per the robustness rules: "Never swallow exceptions in an empty catch block — at minimum log the exception. Silent failures are the hardest to diagnose on a running node."

catch (Exception e)
{
    if (_logger.IsWarn) _logger.Warn($"Could not resolve local IP, falling back to loopback: {e.Message}");
    localIp = IPAddress.Loopback;
}

Fix this →

_legacyDiscoveryDb = legacyDiscoveryDb;
_logManager = logManager;
_allowNonRoutableEnrs = ShouldAcceptNonRoutableEnrs(ipResolver.ExternalIp);
IPAddress externalIp = ipResolver.Resolve().GetAwaiter().GetResult().ExternalIp;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Blocking .GetAwaiter().GetResult() in a constructor.

This blocks the DI resolution thread on external HTTP calls if the IP cache hasn't been warmed yet. It's safe in the current startup ordering because InitializeNetwork (which resolves DiscoveryV5App) declares SetupKeyStore as a RunnerStepDependencies, so the cache should already be populated. But this implicit dependency is invisible to the compiler and fragile: a future refactor that resolves IDiscoveryApp earlier (e.g., in a plugin Init() method) would block without any warning.

The externalIp is only used to compute _allowNonRoutableEnrs and populate the ENR builder — both one-time setup operations. The same pattern applied to RlpxHost (which uses Lazy<IPeerPool> to defer resolution) could be applied here: move the IP resolution into StartAsync() and accept the field as IPAddress? with a NotNullAfterStart invariant, or accept IIPResolver and resolve lazily.

If this is intentional and the startup ordering is considered a firm invariant, add an explicit // Called after SetupKeyStore warms the IP cache; GetAwaiter().GetResult() is safe here. comment so future maintainers understand the constraint.

LocalNodeId = nodeKey.PublicKey;
LocalPort = networkConfig.P2PPort;
LocalIp = networkConfig.LocalIp;
IIPResolver.NethermindIp ips = ipResolver.Resolve().GetAwaiter().GetResult();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Same blocking .GetAwaiter().GetResult() concern as DiscoveryV5App.

RlpxHost is slightly safer than DiscoveryV5App here because it's injected as Lazy<IPeerPool> into InitializeNetwork (the comment on line 71 of InitializeNetwork.cs explains this). So its constructor runs lazily, after the step's async Initialize() method has started — by which point SetupKeyStore has already awaited Resolve() and warmed the cache.

Still, the same implicit ordering caveat applies. Consider at minimum adding a comment here:

// SetupKeyStore.Execute() is a declared dependency of InitializeNetwork and awaits
// Resolve() before this constructor runs, so the cache is guaranteed warm.
IIPResolver.NethermindIp ips = ipResolver.Resolve().GetAwaiter().GetResult();

@@ -19,17 +19,39 @@ public class IPResolver(INetworkConfig networkConfig, ILogManager logManager) :
private readonly INetworkConfig _networkConfig = networkConfig ?? throw new ArgumentNullException(nameof(networkConfig));
private readonly ILogManager _logManager = logManager;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — _logManager field is stored but never used at class level.

_logger (line 18) covers all direct logging. _logManager is only forwarded to IP source constructors within InitializeExternalIp() / InitializeLocalIp(). Since those are private methods with access to the constructor parameter, you can either pass the parameter directly without storing it as a field, or remove the field and use the local logManager parameter in the constructor-captured closure. As-is it's a dead field that adds noise.

- NodeRecordProvider/IPResolver: guard the cached resolution with Lock + cached Task; resolve with CancellationToken.None so a single caller can't fault the shared result, and honor per-call cancellation via WaitAsync.

- IPResolver: log the previously-swallowed local-IP resolution exception; drop the unused _logManager field.

- DiscoveryV5App/RlpxHost: document why the constructor Resolve().GetAwaiter().GetResult() is safe (cache warmed by SetupKeyStore, a declared InitializeNetwork dependency).

- Remove redundant System.Threading/System.Threading.Tasks usings in the Discovery files (IDE0005; the project has ImplicitUsings enabled).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants