Skip to content

feat: Network Relay Registry (Phase 1) + RFC 07 In-Process PeerResolver (PRs #471, #494, #496, #497, #499, #501 bundled)#506

Merged
branarakic merged 39 commits into
mainfrom
feat/rfc07-pr5-gossipsub-msgid
May 13, 2026
Merged

feat: Network Relay Registry (Phase 1) + RFC 07 In-Process PeerResolver (PRs #471, #494, #496, #497, #499, #501 bundled)#506
branarakic merged 39 commits into
mainfrom
feat/rfc07-pr5-gossipsub-msgid

Conversation

@branarakic
Copy link
Copy Markdown
Contributor

What this PR is

A consolidation of two stacked feature trains that landed correctly on their feature branches but never reached main:

  1. PR feat: Network Relay Registry — Phase 1 (relayCapable + multiaddrs on chain) #471 — Network Relay Registry Phase 1 (relayCapable + multiaddrs on chain). This was already approved/MERGEABLE on feat/461-relay-registry.
  2. The RFC 07 stack — In-Process PeerResolver (PRs core(network): add Network interface + LibP2PNetwork wrapper (RFC 07 PR-1) #494, core(network): add PeerResolver + migrate Messenger (RFC 07 PR-2) #496, core(network): migrate ProtocolRouter to consult PeerResolver (RFC 07 PR-3) #497, core(network): migrate /api/connect + add CI grep gate (RFC 07 PR-4) #499, core(network): lock in dkgGossipMsgId for cross-backend dedup (RFC 07 PR-5) #501). All 5 were merged via the GitHub UI bottom-up earlier this evening, but each merged into its intermediate stacked feature branch (its actual base) rather than cascading down to feat/461-relay-registry. Net result: only PR core(network): add Network interface + LibP2PNetwork wrapper (RFC 07 PR-1) #494 made it onto feat/461-relay-registry; the other four (core(network): add PeerResolver + migrate Messenger (RFC 07 PR-2) #496-501) ended up stranded on their dead-end intermediates.

This PR points the tip of that work — feat/rfc07-pr5-gossipsub-msgid — directly at main, which is the only branch that contains every commit from the entire intended train.

Supersedes / closes

What's bundled

Source PR Subject Status before this PR
#471 Relay Registry Phase 1 (Profile.relayCapable + multiaddrs, EVM/mock adapters, daemon publish on startup) open, approved, never merged
#494 RFC 07 PR-1: Network interface + LibP2PNetwork wrapper (incl. Codex round-1 fix: runOnLimitedConnection + handler async/abort) merged to relay-registry, but the round-1 fix never made it forward
#496 RFC 07 PR-2: PeerResolver + Messenger migration (incl. round-6 prime-then-append fix + abort-listener race fix) merged to dead-end branch only
#497 RFC 07 PR-3: ProtocolRouter consults PeerResolver merged to dead-end branch only
#499 RFC 07 PR-4: /api/connect migration + CI grep gate (incl. round-5 fixes: dial-phase abort detection, audit aliasing/destructuring, devnet §28c/§30 strict mode) merged to dead-end branch only
#501 RFC 07 PR-5: dkgGossipMsgId + dkgGossipMsgIdRaw (incl. round-6 empty-publisher rejection) merged to dead-end branch only

Stats

  • 38 commits ahead of main
  • 37 files changed, ~5K insertions, ~250 deletions
  • All changes already reviewed across the original 6 PRs

Test state on tip

  • 54/54 tests pass: libp2p-network + protocol-router + protocol-router-resolver + peer-resolver + gossip-msg-id
  • 33/33 audit-dial-protocol.test.mjs pass
  • audit-dial-protocol.mjs runs clean against the codebase
  • Devnet test (scripts/devnet-test.sh) §28-30 cover RFC 07 connectivity end-to-end (most recent run: 7 RFC 07 tests PASS, 30/30 cross-node matrix, post-restart resilience strict-mode PASS)

Why a new PR rather than fix-forward on #471

The cleanest alternative was to merge feat/rfc07-pr5-gossipsub-msgid into feat/461-relay-registry and let #471 carry it. That involves a merge commit and likely conflict resolution against the pre-Codex-fix #494 head that's already on relay-registry. Opening this PR directly is one fewer step and produces a clean linear-ish history into main.

Post-merge cleanup

Once this is merged, the following branches can be deleted:

  • feat/461-relay-registry
  • feat/rfc07-pr1-network-iface
  • feat/rfc07-pr2-peer-resolver
  • feat/rfc07-pr3-protocol-router
  • feat/rfc07-pr4-connect-and-gate
  • feat/rfc07-pr5-gossipsub-msgid (the head of this PR)

Made with Cursor

Branimir Rakic and others added 30 commits May 13, 2026 01:45
…461)

Phase 1A of the Network Relay Registry rollout. Extends `ProfileInfo`
with two on-chain fields the rest of Phase 1 relies on:

- `bool relayCapable` — operator-controlled opt-in that this node will
  serve circuit-relay traffic for the network. Phase 3 NetworkStateRegistry
  prefers chain-attested relays during peer discovery.
- `string[] multiaddrs` — externally-reachable libp2p multiaddrs the
  node currently advertises. Phase 3 edges resolve circuit-relay paths
  from this set instead of the hardcoded network/*.json relay list.

`Profile.sol` exposes two `onlyIdentityOwner`-gated entry points
(`updateRelayCapable`, `updateMultiaddrs`) so a running node with an
operational key can refresh its multiaddrs each restart without admin
signing. `updateMultiaddrs` enforces modest bounds (≤8 entries,
≤256 bytes each) and emits `MultiaddrsUpdated` for off-chain consumers.

Version bumps signal the storage-layout extension to operators:
- ProfileStorage 1.0.0 → 1.1.0 (struct extended at the tail; existing
  keys read new fields as zero/empty).
- Profile 1.1.0 → 1.2.0 (new entry points).

13 new Solidity unit tests covering happy-path, owner gating, bound
violations, wholesale-replacement semantics, and the extended
`getProfile` tuple.

Co-authored-by: Cursor <cursoragent@cursor.com>
…#461)

Phase 1B of the Network Relay Registry rollout. Surfaces the on-chain
relay-registry fields added in Phase 1A through `ChainAdapter` so the
agent / publisher / future NetworkStateRegistry don't have to reach
into the Profile ABI directly.

Four new optional methods on `ChainAdapter`:
- `getRelayCapable(identityId)` / `getMultiaddrs(identityId)` — read
  any node's chain-attested relay state.
- `setRelayCapable(bool)` / `setMultiaddrs(string[])` — publish this
  node's own state; resolves identityId from the bound signer.

`EVMChainAdapter` resolves `ProfileStorage` (tolerated as optional so
older Hub deployments still init), implements all four methods, and
adds `MultiaddrsUpdated` + `RelayCapabilityUpdated` to `listenForEvents`
so Phase 3 NetworkStateRegistry can refresh its cache without a follow-up
storage read per peer.

`MockChainAdapter` keeps an in-memory mirror keyed by identityId, with
the same client-side bound checks (≤8 entries, ≤256 bytes each) so
adapter-parity tests catch drift on either side. Events flow through
the generic `pushEvent` / `listenForEvents` pipe.

Tests:
- 9 mock-adapter tests covering reads, writes, idempotence, bound
  violations, and the event surface.
- 8 EVM-adapter tests against a real Hardhat node (round-trip,
  wholesale-replacement, fast-fail, event roundtrip).

Existing chain test suite (231 tests, including ABI pinning) stays green.

Co-authored-by: Cursor <cursoragent@cursor.com>
…4 / #461)

Phase 1C of the Network Relay Registry rollout. Wires the chain-adapter
surface from Phase 1B into the daemon startup so operators get on-chain
relay attestation for free — every restart pushes the node's current
externally-reachable multiaddrs (and optionally the relayCapable flag)
to its on-chain Profile.

`DKGAgent.publishRelayRegistry({ relayCapable })` is the new method:
- Filters libp2p `multiaddrs` down to the externally-dialable subset
  (public IPv4/IPv6/DNS plus /p2p-circuit/ for NAT'd edges); local /
  RFC1918 / link-local entries are dropped.
- Idempotent: compares the candidate set against what's already on
  chain and skips the tx when they match. Safe to call every restart.
- Bound-clamped to MAX_MULTIADDRS=8 (mirrors ProfileLib).
- Best-effort and no-throw: missing chain config, no profile, adapters
  without the relay-registry surface, and chain RPC errors are all
  logged-and-skipped rather than crashing the daemon.

`isPublicLikeAddress` / `isLocalOrInternalHostname` are re-exported from
`@origintrail-official/dkg-core` so the agent can reuse the same
classifier the existing daemon dialability log uses.

Daemon wiring fires the publish on a non-blocking setTimeout(0) right
after the relay-reservation log, so chain RPC slowness can't gate
daemon ready. New `relayCapable?: boolean` flag on `DkgConfig` lets
operators opt into advertising relay service (off by default — relay
provision has bandwidth/availability implications).

8 unit tests pin the filter behaviour (loopback / RFC1918 / link-local
rejection, /p2p-circuit/ acceptance, dedup, defensive against runtime
garbage). Full agent test suite (439 tests) stays green.

Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end soak harness for the 6-node devnet (4 core + 2 edge):

- Bootstraps 6 random delegator wallets to reach 10 total staking wallets
  (4 op-wallets from devnet.sh staking 50k each + 6 fresh delegators
  staking 25k each on cores 1-4 in a 2/2/1/1 split). Asymmetric stake
  on purpose so RS score-per-stake variance is observable.
- Drives a positive workload: round-robin `dkg publish` across both
  registered CGs and across all 4 core nodes via the existing CLI surface
  (no chain-direct shortcuts beyond the bootstrap setup that devnet.sh
  already does for staking).
- Runs three background loops for the duration:
    * publisher loop  — N-Quad publishes every PUBLISH_INTERVAL_SEC
    * observer loop   — snapshots every node /api/random-sampling/status,
                        /api/status, /api/connections + on-chain
                        getNodeEpochScore / getNetNodeEpochRewards every
                        SNAPSHOT_INTERVAL_SEC
    * event listener  — polls RandomSamplingStorage for
                        EpochNodeValidProofsCountIncremented +
                        NodeEpochScoreAdded + ActiveProofPeriodStartBlockSet
                        (the actual on-chain proof signal — RandomSampling
                        itself does not emit a ProofSubmitted event)
- Negative scenarios woven in:
    * publish to unregistered CG -> expect HTTP 4xx
    * edge nodes 5+6 must report `enabled=false` and submittedCount=0
      across every snapshot
    * mid-soak: gracefully bounce one core for ~120s, assert API recovers
      and prover resumes (skip with SOAK_SKIP_RESTART=1)
- Final report at OUT_DIR/findings.md with pass/warn/fail breakdown,
  per-core stake/score/reward table, and per-delegator rolling rewards.
  Exits non-zero on any FAIL.

Also shortens the development Chronos.epochLength from 30 days to 30
minutes. The default makes a 2h soak observe 0.28% of an epoch — neither
setNetNodeEpochRewards nor delegator rolling rewards ever fire, so the
test cannot exercise reward distribution. 30 minutes lets a 2h run cross
~3 epoch boundaries and watch the full accrue -> finalize -> claim cycle.

3-min smoke against a fresh devnet clears 20/20 assertions:
- 15/15 publishes confirmed on chain
- All 4 cores submit proofs (1->10, 2->4, 3->4, 4->2 over 4 minutes)
- 10 EpochNodeValidProofsCountIncremented events from 4 distinct ids
- Edge nodes 5+6 stay disabled across all snapshots
- Score asymmetry visible: high-stake core 1 score 1.87e18 vs low-stake
  core 4 score 1.02e17 (both proving)

Companion to PR #471 (Network Relay Registry Phase 1).

Co-authored-by: Cursor <cursoragent@cursor.com>
Reverts the development.Chronos.epochLength change from 30d -> 30min
that landed alongside the soak runner. The shorter epoch was intended
to let a 2h soak observe epoch transitions, but it also broke
unrelated Solidity tests (3 of 4 hardhat shards red on PR #471) that
have hardcoded expectations on the canonical 30-day dev epoch.

The soak script now reads the deployed Chronos.epochLength on startup
and surfaces a clear instruction when the epoch is too long for the
soak duration: edit parameters.json + clean+restart devnet manually.
We deliberately do not patch parameters.json from the script because
it lives on the Solidity test path; touching it implicitly would
break CI for everyone running the soak in a side branch.

Trade-off: with the default 30d epoch the 2h soak still validates RS
prover ticks, on-chain submitProof events, in-epoch score accrual, and
publish + edge-disable + restart-recovery. Reward DISTRIBUTION
(setNetNodeEpochRewards + delegator-roll updates) is the one thing
suppressed — operator runs the manual flow if they care about that.

Co-authored-by: Cursor <cursoragent@cursor.com>
Round 1 of the 2h soak surfaced three issues with the test harness
itself (not the node code):

1. The heartbeat one-liner
     pcount=$([ -f X ] && wc -l < X | tr -d ' ' || echo 0)
   blew up at the END_TS boundary with `=NNNN: command not found`
   because of bash operator precedence interactions between `&&`,
   `|`, and `||` inside `$(...)` under load. Replaced with a plain
   `safe_line_count` helper so the heartbeat survives weird filesystem
   timing.

2. The RS-prover assertion read `submittedCount` from the LAST
   snapshot, but the last 1-2 snapshots can be empty (`rs: {}`,
   `uptime_ms: null`) when the agent or the snapshot fetcher is
   mid-shutdown — a healthy core that submitted 31 proofs then
   misreports as "never submitted". Switched to PEAK across the full
   timeseries; non-empty `rs` snapshots are tracked separately and
   reported as a soft warning.

3. The `epochScore`-growth assertion compared start vs. end of soak,
   but `epochScore` resets to 0 at every epoch transition. With a
   30-min epoch and a 2h window, the final snapshot lands in a fresh
   epoch and reports all zeros. Switched to aggregating per-identity
   total score from `NodeEpochScoreAdded` chain events across ALL
   epochs in the soak — gives a clean stake-weighted asymmetry view
   regardless of when the run starts/ends in the epoch cycle.

Re-running the new logic against R1's intact data: 22 ok / 1 warn /
0 fail. Stake-weighted asymmetry now visible: identity 1 (100k stake)
earned ~2.6× the score of identity 4 (75k stake).

Also clarified the `netNodeEpochRewards=0` "warning" — that field is
populated lazily on first reward claim (StakingV10 path), not at
epoch transition. The soak observes accrual via `epochScore` events,
which is the actual signal of reward-pipeline liveness.

Co-authored-by: Cursor <cursoragent@cursor.com>
…PI gap

The "OBSERVER N snapshots had empty rs:{}" warning was conflating two
unrelated cases: edge nodes that legitimately have `loop: null` (no
prover, role-gated), and core nodes where the agent API was briefly
unavailable. R1 reported "34 empty rs" mostly because of edges 5+6
appearing in every snapshot.

Now we only count core-node snapshots with the entire `rs` object
missing/empty (a real API transient), and break the count down per
node so the operator can see where the gap was. R1 re-evaluated:
node2=2, node3=20, node4=12 — node3 at ~19% gap rate is worth
keeping an eye on across rounds.

Co-authored-by: Cursor <cursoragent@cursor.com>
R2 finished with `started epoch: 8 → ended epoch: 13` (5 transitions)
but the assertion misreported "no epoch transition observed". Cause:
`stakeAfter.epoch` and `stakeBefore.epoch` are JSON strings, and the
bare `>` operator does lexicographic comparison — "13" > "8" returns
false. Switched to BigInt() for a numeric compare.

R2 also surfaced an operational finding worth recording (no code fix
needed): the in-process oxigraph backend (nodes 3+4) caused 54-66%
API gap rates by R2, up from 11-19% in R1, as the KC corpus grew.
The peak-aggregation logic correctly extracted the right answer
across the gaps; the warning is informational about the backend
trade-off, not a defect in the node.

Co-authored-by: Cursor <cursoragent@cursor.com>
RFC 04 Network State Registry v0.3 narrows the on-chain Profile
surface: multiaddrs no longer live on Profile (they will live in
per-RS-round attestation KCs minted by submitProofV2 in Phase 2).
Reverts the multiaddrs storage path introduced earlier in this PR
while preserving the relayCapable flag and surrounding scaffolding,
which Phase 2 still needs.

Surgical revert only — net -733 lines:

Contracts (evm-module)
  ProfileLib.sol: drop multiaddrs[] field, MAX_MULTIADDRS /
    MAX_MULTIADDR_LENGTH constants, three multiaddr error types.
    Keep relayCapable.
  ProfileStorage.sol: drop MultiaddrsUpdated event,
    getMultiaddrs/setMultiaddrs methods. getProfile no longer
    returns string[]. Keep RelayCapabilityUpdated +
    getRelayCapable/setRelayCapable.
  Profile.sol: drop updateMultiaddrs entry point + its validation.
    Keep updateRelayCapable.
  Profile.test.ts: drop multiaddrs describe block; keep relayCapable
    suite (6 tests). All 27 unit tests pass.
  ABIs regenerated from compiled contracts.

Chain adapter
  chain-adapter.ts: drop optional getMultiaddrs/setMultiaddrs from
    interface.
  evm-adapter.ts: drop concrete impl + MAX_MULTIADDRS constants +
    MultiaddrsUpdated event handling.
  mock-adapter.ts: drop multiaddrsByIdentity map + impl + event.
  Adapter tests refactored to cover only the relayCapable round-trip
  and event surface. All 229 chain tests pass.

Daemon
  dkg-agent.ts: drop filterPublishableMultiaddrs helper. Slim
    publishRelayRegistry() to flag-only sync (no candidate filter,
    no clamp, no diff, no setMultiaddrs).
  relay-registry-publish.test.ts: deleted (its only target is gone).
  lifecycle.ts: comment updated to match flag-only semantics.
  cli/config.ts: relayCapable doc updated to point at RFC 04 v0.3 §5.2.
  core/index.ts: revert isPublicLikeAddress / isLocalOrInternalHostname
    re-exports (only consumer in dkg-agent.ts is gone).
  All 431 agent tests pass.

Doc cross-link
  messenger.ts: comment now points at 04_NETWORK_STATE_REGISTRY.md
    + 07_IN_PROCESS_PEER_RESOLVER.md.

Validation:
  npx hardhat compile: clean
  pnpm -F dkg-core dkg-chain dkg-agent dkg build: all green
  Lint: clean across all touched files
  Profile.test.ts: 27/27
  chain tests: 229/229
  agent tests: 431/431

This is Phase 0 of RFC 04 v0.3 §8 migration plan. Phase 1 (relayCapable
flag + adapter scaffolding + daemon plumbing) was already complete on
this branch. Phase 2 (submitProofV2, attestation KCs) lands separately.

Co-authored-by: Cursor <cursoragent@cursor.com>
…PR-1)

Introduces the transport-pluggable boundary RFC 07 §5 commits to:

- packages/core/src/network/network.ts — Network, NodeIdentity, Address,
  DialOpts, ProtocolHandler. The minimum surface every dial-path
  consumer needs (sync, chat, skill invoke, /api/connect, …) so the
  application stops importing libp2p directly outside one location.
- packages/core/src/network/libp2p-network.ts — LibP2PNetwork is a thin
  facade over an existing DKGNode; no state is duplicated, lifecycle
  is shared.
- packages/core/test/libp2p-network.test.ts — exercises the contract
  against two real libp2p instances (dial round-trip, getConnections,
  addKnownAddresses, unhandle, lifecycle idempotency).

This is purely additive. DKGNode.libp2p direct accessor stays during
the RFC 07 transition; no consumer is migrated in this PR. Subsequent
PRs introduce PeerResolver (PR-2) and migrate Messenger /
ProtocolRouter / /api/connect (PR-3 / PR-4) to consume Network, ending
in a CI grep gate that disallows raw dialProtocol(peerId) outside
peer-resolver.ts.

Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §5
Co-authored-by: Cursor <cursoragent@cursor.com>
…Network (Codex PR #494)

LibP2PNetwork's first cut silently dropped two behaviours that the
existing pre-RFC-07 dial path (ProtocolRouter) relied on. The first
consumer migrated to Network.dialProtocol() / Network.handle() would
have regressed on relay-only links and on async handler rejections.

Two fixes:

1. dialProtocol() and handle() now both pass `runOnLimitedConnection: true`
   like ProtocolRouter.send / ProtocolRouter.register already do. V10
   edge↔core traffic regularly traverses circuit-relay limited
   connections (RFC 04); without this flag, libp2p refuses to use a
   limited-only connection for protocol streams. Codex called this out
   on dialProtocol; same fix applies to inbound handle().

2. handle() now awaits the user-supplied handler in a try/catch and
   stream.abort()s on failure. The previous wrapper fire-and-forgot
   the promise, so an async handler rejection became an
   unhandledRejection AND left the inbound stream half-open until
   libp2p's own timeout fired. Mirrors the try/catch+abort pattern in
   ProtocolRouter.register exactly.

New test (handler-rejects/1.0.0) is a regression guard: a handler that
intentionally throws after a 50ms barrier; the dialer's drain must
terminate in well under 5s. Without the fix it hangs to the 10s vitest
test timeout.

8/8 LibP2PNetwork tests pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
Introduces the single in-process service every dial path will consult
before dialProtocol. PR-2 migrates Messenger as the first consumer;
ProtocolRouter and /api/connect follow in PR-3 and PR-4.

New (packages/core):
- src/network/network-state-registry.ts — NetworkStateRegistry interface
  + StubNetworkStateRegistry. Returns empty until RFC 04 Phase 2 lands;
  the interface is fixed so the future PR can swap the impl without
  touching consumers.
- src/network/peer-resolver.ts — PeerResolver implementing RFC 07 §3.1
  resolution order:
    1. Active-connection cache (network.getConnections)
    2. DHT lookup (network.findPeer with per-step timeout)
    3. NetworkStateRegistry (stubbed)
    4. AgentDirectoryLookup (agents-CG SPARQL fallback; minimal interface
       defined here so packages/core has no dependency on packages/agent)
    5. Bootstrap seeds (only when 1-4 produced nothing)
  Each step that finds addresses calls network.addKnownAddresses so the
  transport's address book is primed before the caller dials. Returns
  deduplicated, freshest-first; never throws.
- test/peer-resolver.test.ts — 13 cases covering the resolution order,
  failure tolerance, dedup, and the stubbed health-scoring surface.

Migrated (packages/agent):
- src/p2p/messenger.ts — drops Libp2pLike + DiscoveryClient deps;
  consumes PeerResolver instead. The chat-only inline patching path
  (ensureCircuitRelayAddress) is now subsumed by step 4 of the
  resolver — same SPARQL lookup, same /p2p-circuit/p2p/<peerId>
  shape, but available to every dial path that consults the resolver.
- test/p2p-messenger.test.ts — slimmed to verify the structural
  property (resolver fires before router.send) and timeout forwarding;
  resolution-order cases moved to peer-resolver.test.ts where they
  belong.
- src/dkg-agent.ts — wires LibP2PNetwork + PeerResolver +
  StubNetworkStateRegistry, exposes peerResolver as a public field
  so PR-3 / PR-4 can reuse the same instance.

Tests:
- 648/648 pass in core (+13 new)
- 429/429 pass in agent

Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §3
Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #496:

1. Bootstrap seeds are addresses of seed peers (curators/relays), NOT
   addresses for the requested target peerId. Returning them as
   candidates would either fail loudly (libp2p rejects /p2p/<seed>
   vs requested target peerId mismatch) or quietly mislead the
   caller. Bootstrap is a libp2p-startup concern (`bootstrap({ list })`
   peerDiscovery in node.ts) — not a per-peer resolution concern.
   Removed step 5 entirely, dropped bootstrapSeeds from PeerResolverDeps,
   updated dkg-agent.ts construction site, dropped two bootstrap-related
   test cases, kept the "returns empty array when nothing resolves" case
   (now without seeds in the picture).

2. Registry step (step 3) was the only lookup path not in try/catch,
   contradicting the documented "never throws" contract. Once the real
   NetworkStateRegistry lands (DB / mutex / SWM-backed), a transient
   hiccup would abort the whole resolve() and silently skip steps 4+.
   Wrapped registry.lookup() + addKnownAddresses() in the same
   best-effort handler the DHT and agents-CG steps use, added a
   regression test (registry throws → falls through to step 4 cleanly).

Tests:
- core 13/13 in peer-resolver.test.ts (was 13, refactored 2 cases,
  added 1 case)
- agent build clean

Co-authored-by: Cursor <cursoragent@cursor.com>
…nger

Codex review feedback on PR #496 (round 2):

1. PeerResolver step 1 (`network.getConnections(peerId)`) was outside
   any try/catch. The class doc-comment promises `resolve()` never
   throws, but `LibP2PNetwork.getConnections` calls
   `peerIdFromString(peerId)` under the hood and that throws on a
   malformed peerId. Callers downstream (ProtocolRouter,
   connectToPeerId) would see an unhandled exception instead of an
   empty result they could surface as PEER_NOT_FOUND / DIAL_FAILED.

   Fix: wrap the live-conn lookup in the same best-effort handler
   the later steps use. A bad peerId now falls through, every later
   step fails identically, the resolver returns []. Added a regression
   test that simulates `getConnections` throwing.

2. Messenger.sendToPeer was firing `resolver.resolve(peer)` with no
   options, so a caller passing `opts.timeoutMs: 1000` could block
   5s+ on the resolver's default per-step DHT timeout before the
   router.send dial loop even started.

   Fix: when timeoutMs is provided, pass an AbortSignal +
   perStepTimeoutMs through to resolve(). PR-3 of the RFC 07 stack
   moves this resolver call into ProtocolRouter and removes the
   inline call entirely; until then this passthrough keeps PR #496
   sound on its own merits.

Tests:
- core peer-resolver 14/14 (was 13, +1 for live-conn try/catch)
- core + agent build clean

Co-authored-by: Cursor <cursoragent@cursor.com>
…budget passthrough

Codex review feedback on PR #496 (round 3):

1. perStepTimeoutMs silently lost when opts.signal is set.
   `LibP2PNetwork.findPeer` honoured the signal and ignored the
   timeoutMs, so any caller passing both would silently lose the
   per-step cap. The exported `ResolveOpts` contract said both were
   real.

   Fix: compose a step-local AbortSignal that combines BOTH inputs
   (`AbortSignal.any([AbortSignal.timeout(perStepMs), opts.signal])`)
   and pass that to `network.findPeer`. The contract now matches
   runtime: per-step cap fires at min(perStepMs, time-until-outer-
   abort).

2. Outer signal not checked between resolver steps.
   PR #499 round 2 added a single AbortSignal to bound the whole
   `connectToPeerId` path, but only step 2 (DHT) propagated it to
   `findPeer`. After a DHT timeout the registry + agents-CG steps
   still ran unbounded, defeating the deadline guarantee.

   Fix: check `opts.signal?.aborted` before each step. An aborted
   signal returns whatever's been accumulated so far (or [] if
   nothing yet). Added two regression tests: pre-aborted signal
   skips all steps, and mid-resolve abort prevents later steps from
   running.

3. Messenger.timeoutMs double-spent (round-2 regression).
   Round 1 fix added `opts.timeoutMs` passthrough to
   `Messenger → resolver.resolve()` so the resolver wouldn't run
   unbounded. But `router.send()` then ALSO consumed `opts.timeoutMs`
   with a fresh budget, so cold-peer sends could block ~2x the
   advertised timeout.

   Fix: revert the round-1 passthrough. The proper fix would share
   a single AbortSignal across resolver + router.send, but
   `ProtocolRouter` doesn't accept an external signal today (it
   builds its own internally). Plumbing that through is its own
   refactor — and PR-3 of the RFC 07 stack already moves this code
   path INTO ProtocolRouter (where the budget can be shared without
   leaking through public surfaces) and reduces Messenger to a
   pass-through. This entire Messenger code path is transient;
   reverting the round-1 fix avoids the double-spend without needing
   a router refactor that PR-3 supersedes anyway.

   The pre-PR Messenger callers (chat / sync) already used generous
   timeouts (≥30s), so the resolver's 5s default doesn't push them
   over. Documented the trade-off in the call site.

Tests:
- core peer-resolver 17/17 (was 14, +3 round-3 tests)
- core + agent build clean

Co-authored-by: Cursor <cursoragent@cursor.com>
…Codex PR #496)

Codex review feedback on PR #496 (round 4):

1. NetworkStateRegistry.lookup was sync (returned Address[]).
   The v1 stub is in-memory and could be sync, but the doc explicitly
   said "the real RFC 04 implementation may need to hit a database /
   mutex / SWM channel". Freezing the contract as sync now would
   either force the real impl to block the event loop or require a
   public-API breaking change in @origintrail-official/dkg-core later.

   Fix: change `lookup` to return `Promise<Address[]>`. The v1 stub
   returns a resolved promise instantly (no observable cost). Keeps
   the upgrade path open from day one.

2. AgentDirectoryLookup.findRelayForPeer had no signal param, so step
   4 of PeerResolver couldn't cancel an in-flight SPARQL query when
   the outer signal aborted. The pre-step `signal.aborted` check
   added in round 3 prevents NEW step-4 work from starting once the
   deadline elapses, but a query that's already running would block
   until SPARQL returned naturally — defeating the
   `connectToPeerId({ timeoutMs })` end-to-end deadline guarantee.

   Fix: add an optional `opts.signal` param to
   `AgentDirectoryLookup.findRelayForPeer`. Resolver passes
   `opts?.signal` through. The legacy DiscoveryClient wrapper in
   dkg-agent.ts intentionally ignores it for now (DiscoveryClient
   doesn't expose cancellation today; wiring it through SPARQL is
   a follow-up that touches DiscoveryClient internals). The contract
   is best-effort — implementations that can cancel get the benefit;
   those that can't keep working as before.

   Pinned by a new test "step 4 (Codex PR #496 round 4): forwards
   opts.signal to agentDirectory.findRelayForPeer".

Tests:
- core peer-resolver 18/18 (was 17, +1 round-4 test, 1 existing test
  updated for the new findRelayForPeer signature)
- core + agent build clean
- no other consumers of the changed interfaces (only PeerResolver +
  tests + dkg-agent.ts construction site)

Co-authored-by: Cursor <cursoragent@cursor.com>
…odex PR #496 round 5)

Codex review feedback: the previous revision dropped opts.signal in
the only production AgentDirectoryLookup, leaving the resolver's
documented cancellation guarantee unhonored. A slow SPARQL
findAgentByPeerId could outlive the caller's deadline.

DiscoveryClient itself doesn't (yet) accept an AbortSignal, but
honoring the contract at the adapter boundary is enough: the
adapter races the lookup against the abort signal, so when the
signal fires the resolver gets an immediate `null` and can move on
or surface a timeout. The underlying SPARQL fetch leaks (completes
in the background and its result is discarded), bounded by the
discovery client's own internal timeout. Acceptable trade-off
because:
  - RFC 04 Phase 2 replaces this fallback path entirely
  - the alternative (full DiscoveryClient signal threading) is out
    of scope for this PR
  - the leak is bounded by an existing internal timeout, not unbounded

The resolver pre-step `signal.aborted` check (added in round 3) still
prevents NEW agent-directory work from starting once the deadline
elapses; this round-5 race covers the case where the SPARQL is
already in flight when the signal fires.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ds (Codex PR #496 round 6)

The resolver's contract is "non-empty Address[] return ⇒ peerStore is
primed for these addresses". Earlier rounds appended addrs to
`accumulated` BEFORE awaiting `addKnownAddresses`, so a peerStore
merge failure (malformed/stale multiaddr, transient peerStore error)
left the caller seeing successful resolution while the bare peer-id
dial silently missed every address.

Fix: extract a `primeAndAppend()` helper used by steps 2-4 (DHT,
registry, agents-CG) that awaits the merge first and only pushes the
addr into `accumulated` on success. Step 1 (live-conn) doesn't need
it — those addrs come straight from active connections, no priming
required.

Regression test: a network mock whose addKnownAddresses always throws
returns [] instead of advertising "primed" addrs the dial would miss.

19/19 PeerResolver tests pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
… (Codex PR #499 round 5)

The race window: between the entry-time `if (opts?.signal?.aborted)`
check and `signal.addEventListener('abort', ...)` inside the new
Promise constructor, the signal could fire. Since the `abort` event
is one-shot, our late listener would never see it and the Promise
would hang for the full lookup duration — defeating the entire
point of the race wrapper.

Fix: re-check `signal.aborted` INSIDE the Promise constructor (atomic
with the addEventListener call). If we lost the race, resolve null
immediately.

Same hoisting move pulls `opts.signal` into a stable `signal` const
so the constructor closure doesn't have to assert non-null on each
access.

Co-authored-by: Cursor <cursoragent@cursor.com>
… PR-3)

ProtocolRouter.send() now consults the PeerResolver before dialing,
so every protocol that goes through the router (sync, skill invoke,
OpenClaw channel, query-remote, storage-ack, verify-proposal,
join-request, swm-sender-key, plus all of Messenger's chat traffic
that already routes through here) hits the same resolution order
that Messenger.sendToPeer got in PR-2.

Changes:
- packages/core/src/protocol-router.ts: introduces ProtocolRouterOptions
  with an optional peerResolver. send() invokes resolver.resolve(peerIdStr)
  best-effort before the dial loop. Resolver step 1 (live-conn cache)
  is sub-millisecond, so the cost is invisible when a connection
  already exists; the win is on cold dials, where the resolver primes
  the libp2p peerStore with whatever the resolution order finds
  (live-conn → DHT → RFC 04 registry stub → agents-CG → bootstrap).
  peerResolver is optional during the migration; tests that don't
  exercise resolution can omit it. PR-4 adds the CI grep gate that
  enforces the property structurally.
- packages/core/src/index.ts: exports ProtocolRouterOptions.
- packages/agent/src/dkg-agent.ts: reorders construction to build
  network → resolver → router → messenger so the same resolver is
  shared across both consumers.
- packages/core/test/protocol-router-resolver.test.ts: 3 cases
  verifying the resolver is consulted exactly once per send with the
  correct peerId, that resolver failures don't block send, and that
  omitting the resolver preserves legacy behaviour.

Tests:
- 651/651 pass in core (+3 new)
- 429/429 pass in agent (no regressions from the constructor reorder)

Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §3.2
Co-authored-by: Cursor <cursoragent@cursor.com>
…er resolve

Codex review feedback on PR #497:

1. ProtocolRouter.send was calling peerResolver.resolve() with no
   options, so the resolver ran with its default 5_000ms per-step DHT
   timeout regardless of the caller's `timeoutMs`. A caller passing
   `timeoutMs: 1000` could block 5s+ on resolution before the dial
   even started, blowing the timeout budget by 5x.

   Fix: pass the same AbortSignal + remaining time budget to the
   resolver. The dial loop signal is already constructed at the top
   of send(); reusing it ensures a single end-to-end deadline covers
   resolution + dial + read + retries.

2. Messenger.sendToPeer was doing its own peerResolver.resolve() call
   before delegating to router.send(), which (since PR-3) does the
   same lookup. Cold sends paid the DHT walk twice. Removed the
   duplicate from Messenger; the resolver dependency is now owned by
   ProtocolRouter alone, and Messenger is a thin pass-through to the
   router. Updated MessengerDeps + dkg-agent.ts construction site.

   Test rewrite: p2p-messenger.test.ts no longer mocks a resolver
   (Messenger doesn't hold one); the structural property "resolver
   primes peerStore before dialProtocol" is verified at the router
   layer in protocol-router-resolver.test.ts.

Tests:
- core 16/16 in peer-resolver + protocol-router-resolver
- agent 3/3 in p2p-messenger
- core + agent build clean

Co-authored-by: Cursor <cursoragent@cursor.com>
…ex PR #497)

Codex review feedback on PR #497 (round 3):

> the regression test doesn't actually prove the new guarantee. It
> warms the connection first (a.libp2p.dial(...) on line 40), so
> router.send() can succeed even if the resolver is never consulted
> before the first dial, and there is no assertion on dialProtocol
> call order.

Correct — pre-warming meant peerStore was already populated via the
identify exchange, so even a no-op resolver wouldn't have caused
send() to fail. The test couldn't catch the PR-448-class bug it was
written to defend against.

Rewrite strategy:

1. Keep the peers cold (no pre-dial).
2. Assert peerStore for B is empty BEFORE routerA.send().
3. Wire the resolver to populate peerStore with B's real addr when
   called.
4. Send succeeds → only possible if the resolver ran before
   dialProtocol. (Otherwise libp2p has nothing to dial.)
5. Assert peerStore for B is populated AFTER send.
6. Assert resolver.resolve was called exactly once.

The "send-succeeded-against-cold-peerStore" property is the
structural proof of ordering — no dialProtocol instrumentation
needed (which had teardown-race issues in an earlier attempt).

Split out the existing AbortSignal + perStepTimeoutMs assertion into
its own test so each test asserts one property; that test still
pre-warms (it's testing arguments, not the priming side effect).

Tests:
- core protocol-router-resolver 4/4 (was 3, +1 for the cold-peer
  ordering rewrite; the "AbortSignal+perStepTimeoutMs" assertion
  split into its own test)

Co-authored-by: Cursor <cursoragent@cursor.com>
…esolver (Codex PR #497 round 5)

Codex review feedback: keeping peerResolver optional makes the cold-
peer priming guarantee implicit. Any future `new ProtocolRouter(node)`
on an outbound path would compile and silently skip priming,
reintroducing PR-448's class of relay-route failures.

Two mitigations are now in place:
  1. CI grep gate (`scripts/audit-dial-protocol.mjs` from PR #499)
     bans raw `dialProtocol(peerId)` calls outside an allowlist, so
     any new outbound path is forced through ProtocolRouter.
  2. (this commit) `send()` emits a one-time `console.warn` the
     first time it runs without a peerResolver, surfacing
     misconfiguration loudly at first cold dial rather than silently
     regressing.

The full enforcement (`peerResolver` becoming required) is intentionally
deferred — it would be a breaking change to test fixtures and edge
callers that legitimately route over warm connections only. The
warn+grep combo turns the silent-bypass risk into a build-time +
first-call-time surface.

Adds a test that asserts: (a) the warn fires, (b) only once across
multiple sends, and (c) references RFC 07 / PR #448 in the message.

Co-authored-by: Cursor <cursoragent@cursor.com>
This is the last structural piece of RFC 07 — after this lands, every
peer-dial path in the daemon goes through PeerResolver, and the CI
grep gate prevents anyone from adding a new dialProtocol( site that
silently bypasses the resolver.

Changes:

- packages/agent/src/dkg-agent.ts: connectToPeerId no longer hand-rolls
  a DHT walk + peerStore merge; it delegates to peerResolver.resolve(),
  which runs the full RFC 07 §3.1 order (live conn → DHT → RFC 04
  registry stub → agents-CG fallback → bootstrap seeds). The agents-CG
  fallback in particular is a new capability /api/connect didn't have:
  a peer whose DHT record is stale but whose relay is still advertised
  in the agents context graph is now reachable. Saves 80 LoC.

  Known regression vs PR #431: the legacy walk distinguished
  DHT_TIMEOUT (504) and DHT_UNAVAILABLE (503) from PEER_NOT_FOUND
  because the inline path saw the underlying error shape. The resolver
  is best-effort and swallows per-step errors (returns []), so both
  transient cases now collapse into 404. The 502 "addrs found but dial
  failed" distinction is preserved. If reviewers want the 503/504
  codes back, the follow-up is a resolveWithDiagnostics method on
  PeerResolver — out of scope here.

- scripts/audit-dial-protocol.mjs: walks every production .ts/.js
  source file under packages/, fails the build if dialProtocol(
  appears outside the explicit allowlist (LibP2PNetwork wrapper +
  ProtocolRouter). Reuses the bypass-resistant comment/string lexer
  from audit-create-random.mjs so dialProtocol( inside comments,
  strings, or regex literals doesn't false-positive.

- scripts/audit-dial-protocol.test.mjs: 7 node:test cases for the
  audit's regex / lexer integration (split invocations, comments,
  strings, multi-hit, no false-positives on dial / dialMultiselect).

- .github/workflows/ci.yml: wires the audit + its test into the
  build job's pre-install audits, alongside the existing
  audit-create-random gate.

Verification:
- `node scripts/audit-dial-protocol.mjs` — passes (2 allowlisted, no
  violations elsewhere)
- `node --test scripts/audit-dial-protocol.test.mjs` — 7/7 pass
- `pnpm --filter dkg-core test` — 651/651
- `pnpm --filter dkg-agent test` — 429/429
- `pnpm --filter dkg-agent exec vitest run test/e2e-dht-dial.test.ts`
  — 3/3 (validates connectToPeerId via PeerResolver against real
  libp2p nodes; new log line "Resolving X via PeerResolver..." →
  "Resolved X → 3 addr(s); dialling..." confirms the new path)
- `pnpm -r build` — clean across all 14 packages

Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §3.2
Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #499:

1. connectToPeerId was passing `timeoutMs` as the resolver's per-step
   budget AND reusing it for the final libp2p.dial(), so a slow DHT
   walk plus a slow dial could exceed the caller's deadline by ~2x.

   Fix: build one AbortSignal at the top, thread it through both the
   resolver call and the dial. The resolver already supports `signal`
   (added in PR-3 fixup). perStepTimeoutMs is now `max(0, remaining)`
   so a near-expired budget doesn't get re-extended for the next step.

2. The audit-dial-protocol regex only matched `.dialProtocol(`, leaving
   `?.dialProtocol(` and bracket-access forms (`["dialProtocol"](...)`,
   `?.["dialProtocol"](...)`, etc.) as easy bypasses. Since the script
   is supposed to enforce RFC 07 §3.2 structurally, those forms have
   to be caught too.

   Fix: split detection into two regexes:
   - MEMBER_ACCESS_RE catches `.dialProtocol(` and `?.dialProtocol(`
     after lexer-stripping comments+strings.
   - BRACKET_ACCESS_RE scans the ORIGINAL text for the bracket form
     (any quote style: ", ', or `), then filters to positions where
     the opening `[` survives in the stripped text (so commented-out
     and in-string occurrences are excluded). Codex's specific
     suggestion was to add regression tests alongside the existing
     newline/comment cases — done.

Tests:
- 15/15 in audit-dial-protocol.test.mjs (was 7, +8 new for the
  bracket/optional/comment-bracket/string-bracket/non-target-bracket
  cases)
- audit-dial-protocol still OK against the production tree
  (2 allowlisted, no violations)
- e2e-dht-dial 3/3 (validates connectToPeerId with single deadline)

Note: PR #499's third Codex finding (PEER_NOT_FOUND unreachable when
bootstrapPeers configured) is fixed by PR #496's fixup that drops
the bootstrap-seeds step from the resolver. This branch picks that up
via rebase.

Co-authored-by: Cursor <cursoragent@cursor.com>
…499)

Codex review feedback on PR #499 (round 2):

1. Bracket regex bypassable with comments inside the brackets.
   `libp2p[/*x*/"dialProtocol"](...)` and
   `libp2p["dialProtocol"/*x*/](...)` are valid JS and the round-1
   regex required only whitespace between `[`, the string, and `]`,
   so both forms walked past the audit. Round 2 replaces the inline
   `\s*` with a `(?:\s | block-comment | line-comment)*` repetition
   so any combination of whitespace and inline comments is accepted.

   Added 5 regression tests covering:
   - comment between [ and string
   - comment between string and ]
   - comments on both sides
   - line-comment + newline interleaved
   - bracket form fully inside an outer block comment (still ignored)

2. Per-line dedup was wrong. Two distinct dialProtocol( calls on the
   same line collapsed into one hit, so an allowlisted file with
   `expectedHits: 1` could silently grow a second raw dial on the
   same line without failing CI. Round 2 dedupes by source INDEX
   instead, and added two regression tests:
   - two member-access calls on the same line → 2 hits
   - one member + one bracket on the same line → 2 hits

3. Doc-comment quoted block-comment delimiters in code examples,
   which terminated the JSDoc block prematurely (parse error). Spelled
   out the delimiters in prose to keep the file self-parsing.

Tests:
- 22/22 in audit-dial-protocol.test.mjs (was 15, +7 round-2 cases)
- production tree audit still OK (2 allowlisted, no violations)

Co-authored-by: Cursor <cursoragent@cursor.com>
… doc (Codex PR #499)

Codex review feedback on PR #499 (round 3):

1. Audit still misses optional-CALL forms.
   `libp2p.dialProtocol?.(...)` and `libp2p["dialProtocol"]?.(...)`
   both execute the raw dial path while bypassing the round-2
   detector. The `?.` between the property access and the open paren
   is a different position from the property-access optional chain
   the round-2 fix covered.

   Fix: extend MEMBER_ACCESS_RE and BRACKET_ACCESS_RE to optionally
   accept `?.` after the property/bracket and before the `(`.
   Added 4 regression tests:
   - foo.dialProtocol?.(
   - foo?.dialProtocol?.(
   - foo["dialProtocol"]?.(
   - foo?.["dialProtocol"]?.(

   All pass; production tree audit still OK (2 allowlisted, no
   violations). Audit suite now: 26/26 (was 22).

2. dkg-agent.ts docstring still mentioned "bootstrap seeds" as a
   resolver step. PR #496's fixup dropped that step entirely, so
   the docstring was now misleading future readers about a
   nonexistent fallback.

   Fix: removed "bootstrap seeds" from the resolution-order list,
   removed it from the PEER_NOT_FOUND error description, and added
   a short note explaining WHY the step was removed (cite Codex
   feedback on PR #496) so a future contributor doesn't try to
   re-add it. Bootstrap remains a libp2p-startup concern.

Co-authored-by: Cursor <cursoragent@cursor.com>
…(Codex PR #499 round 5)

Codex review feedback: PR #499's resolver migration collapsed all
"resolver returned []" cases into PEER_NOT_FOUND. Since /api/connect
upstream maps 404 to a terminal "wrong peer id" outcome and 504 to a
retriable infrastructure error, this turned every DHT timeout / abort
/ network blip into a permanent input error from the UI's perspective
— a public-API regression vs PR #431.

Fix:
- `connectToPeerId` now distinguishes "resolver returned [] AND signal
  aborted" → CONNECT_TIMEOUT (504, retriable) from "resolver returned
  [] cleanly" → PEER_NOT_FOUND (404, terminal).
- `agent-chat.ts` route handler maps the new CONNECT_TIMEOUT code to
  HTTP 504 alongside the legacy DHT_TIMEOUT, so the UI's existing
  retry logic works without changes.

The deeper 503 (DHT_UNAVAILABLE / PEER_ROUTING_UNAVAILABLE) distinction
still requires a `resolveWithDiagnostics` API on PeerResolver and is
left as the documented follow-up — see RFC 07 §3.3 and the updated
`connectToPeerId` docstring.

Co-authored-by: Cursor <cursoragent@cursor.com>
…TP semantics

The existing devnet-test.sh covers SWM/VM, publishing, gossipsub, sub-graphs,
conversation turns and free-CG flows across 27 sections (~2100 lines), but
none of them exercise the RFC 07 /api/connect resolver path or its HTTP
mapping. SECTION 28 closes that gap with seven scenarios:

  28a INVALID_PEER_ID  → HTTP 400 + code=INVALID_PEER_ID
  28b SELF_DIAL        → HTTP 400 + code=SELF_DIAL
  28c PEER_NOT_FOUND
      OR CONNECT_TIMEOUT  → HTTP 404 OR HTTP 504
      A fresh Ed25519 peerId is generated via libp2p so the format always
      parses on the installed version. Both 404 (resolver completes
      cleanly) and 504 (signal aborts, RFC 07 PR #499 round 5) are
      acceptable; anything else is a regression in the route's switch.
  28d Cold-peer dial via PeerResolver — disconnect Node3 from Node1's
      perspective, then /api/connect by peerId only. Asserts the
      resolver primes peerStore so dialProtocol succeeds without the
      caller passing a multiaddr.
  28e Idempotent fast-path — re-call /api/connect on the just-connected
      peer, asserts <= 3s response (resolver step-1 live-conn check).
  28f Empty body  → HTTP 400 (route-level guard)
  28g Audit gate  — runs audit-dial-protocol.mjs locally so a regression
      that re-introduces a raw libp2p.dialProtocol(peerId, ...) outside
      the allowlist surfaces as a devnet-side smoke test, not just at
      PR-merge time in CI.

Verified live against a 6-node devnet (4 core + 2 edge): all 7 PASS.
The §28c run produced HTTP 504 + CONNECT_TIMEOUT in 15s with a freshly-
generated ghost peerId — concrete proof that the round-5 transient-vs-
terminal split works end-to-end through the HTTP boundary, not just in
the unit test.

Co-authored-by: Cursor <cursoragent@cursor.com>
…art resilience (PR #499 follow-up)

Builds on SECTION 28 with three additional RFC 07 coverage scenarios
plus ergonomic wins for the test harness:

  §28h — Legacy /api/connect {multiaddr} form. Pins the non-resolver
         branch in agent-chat.ts so a regression that breaks the
         legacy direct-dial path while editing the resolver branch
         can't slip past CI. Multiaddr extracted from daemon log
         (filtered by node's own peerId to skip the bootstrap-relay
         line).

  §29  — Cross-node /api/connect matrix. Every node /api/connect's
         to every other node by peerId alone (N=6 → 30 pairs).
         Verifies the resolver wiring is uniformly installed across
         the whole cluster — catches "one node missing the resolver
         wiring" init-order bugs that §28d (Node1 → Node3 only)
         can't see. Honest about the mDNS warm-cache caveat in the
         section preamble; emits [SLOW] markers for any pair that
         takes > 2s as a perf-regression canary.

  §30  — Restart resilience. Kills Node1, restarts it with the same
         DKG_HOME, waits for /api/status to come back, then dials
         Node3's peerId from the freshly-restarted Node1. Verifies
         the resolver+API survive a full process restart end-to-end.
         Honest scope note in the preamble: on loopback devnet, mDNS
         reconverges fast enough that this is closer to a
         "restart-and-still-works" smoke test than a true cold-path
         test of the deeper resolver steps. The unit tests
         (peer-resolver.test.ts) cover the cold-path step walk in
         isolation; this section's unique value is end-to-end process
         restart. Destructive — gated by SKIP_RESTART=1 and runs LAST.

Harness ergonomics:
  - SCRIPT_T0 + total-elapsed line in TEST SUMMARY (e.g. "Elapsed:
    215s") so a "this used to take 3 minutes" regression is visible.
  - section_start / section_done helpers wired into §28-30 (per-section
    timing). Existing 27 sections aren't retrofitted — would need
    wrapping each in a function.
  - SKIP_RESTART=1 / SKIP_MATRIX=1 env vars for fast iteration.
  - RESTART_BOOT_TIMEOUT_S=60 default for §30d.

Bash 3 compat: macOS ships bash 3.2, so §29 uses a sparse indexed
array (PEERID_BY_PORT[$port]=...) instead of `declare -A` — same
effect since port numbers are integers.

Tested on 6-node devnet (PR #499 branch + this commit on top):
  §28a-h: 8/8 PASS
  §29   : 30/30 pairs reachable, 0 slow, 0 failed
  §30a-f: 6/6 PASS, restart took 5s, mesh restored to 6 peers

Co-authored-by: Cursor <cursoragent@cursor.com>
Branimir Rakic and others added 8 commits May 13, 2026 22:28
… gate; pin §28c/§30 (Codex PR #499 round 5)

Three Codex review items addressed:

1. dkg-agent.ts dial-phase abort detection (Codex line 4096)
   The shared AbortSignal now covers BOTH resolution and dial. If
   most of the budget went into resolve() and dial() then aborts on
   the same signal, we previously misclassified as DIAL_FAILED (502,
   transport failure). Now we detect signal.aborted / AbortError /
   ABORT_ERR in the dial-phase catch and rethrow as CONNECT_TIMEOUT
   (504, retriable) — preserving the timeout-vs-transport-failure
   distinction across the entire connect path.

2. audit-dial-protocol.mjs aliasing/destructuring bypass (Codex line 170)
   Earlier rounds caught CALL forms (member, optional-chain, bracket,
   optional-call). But aliasing patterns leave no `dialProtocol(`
   token at the call site:
     const fn = libp2p.dialProtocol.bind(libp2p);
     const { dialProtocol } = libp2p;
   Round 4 widens the detector to flag any occurrence of the bare
   `dialProtocol` IDENTIFIER token in stripped (comments/strings
   blanked) text — single regex `\bdialProtocol\b` covers every
   call/alias/destructure form. Bracket-string access (`["dialProtocol"]`)
   still needs the separate BRACKET_ACCESS_RE because the stripper
   blanks string contents.

   Allowlist updated for the broader detection: definition sites
   count too, so:
     - network.ts: 1 (interface declaration)
     - libp2p-network.ts: 2 (impl method definition + libp2p call)
     - protocol-router.ts: 1 (the resolver-consulting call) — unchanged

   33/33 audit-dial-protocol.test.mjs pass (8 new tests for aliasing,
   destructuring, function-arg destructuring, bare reference,
   word-boundary protection of unrelated identifiers).

3. devnet-test.sh §28c (line 2202) + §30 (line 2502)
   §28c was accepting BOTH 404 and 504 for the ghost peer — a
   regression that turned every clean miss into 504 would have passed.
   On this 6-node devnet, the deterministic outcome is 504 (DHT walk
   keeps trying for the full timeoutMs budget for a non-existent peer
   ID). Pinned to 504 only; 404 case is covered by the unit test
   `peer-resolver.test.ts: returns empty array when nothing resolves`.

   §30e was downgrading CONNECT_TIMEOUT/PEER_NOT_FOUND to warn,
   meaning the test silently passed when post-restart connectivity
   was broken — defeating the section's purpose. Strict mode: retry
   /api/connect every 1s for up to 30s (RESTART_RETRY_BUDGET_S
   override), pass on first 200, FAIL hard if it never recovers.
   This absorbs the legitimate mDNS/DHT warmup window without going
   silent on real regressions.

Co-authored-by: Cursor <cursoragent@cursor.com>
… PR-5)

Last of the RFC 07 stack. Replaces the upstream gossipsub msgIdFn
(which uses key+seqno for signed messages, sha256(data) for unsigned)
with a single content-deterministic constant the network commits to:

  msgId(topic, payload, fromIdentityId)
    := sha256(topic ‖ payload ‖ fromIdentityId)

  fromIdentityId = signed → from.toMultihash().bytes
                   unsigned → ∅ (empty)

Why now, behaviour-invisibly: v1 ships only one gossip backend
(LibP2PGossipBackend), so msgIds change but no consumer compares them
across implementations today. The point of locking in the constant
*before* a second backend exists (iroh-gossip, anything else RFC 07
might license) is that the wire-format change has to land coordinated
across the entire network — adding a different-but-incompatible msgId
later would force a synchronised mid-flight upgrade. Doing it now is
free; doing it later is operationally painful at scale.

Changes:
- packages/core/src/network/gossip-msg-id.ts: dkgGossipMsgId function;
  total over the signed/unsigned Message union; returns 32-byte sha256.
- packages/core/src/network/index.ts + src/index.ts: re-exports.
- packages/core/src/node.ts: passes msgIdFn to gossipsub() construction.
- packages/core/test/gossip-msg-id.test.ts: 8 cases pinning the encoding,
  covering signed/unsigned/UTF-8/dedup-by-content/no-leak-of-seqno-or-sig.

Tests:
- core 659/659 (+8 new)
- agent 429/429 (msgId change is content-equivalent for V10 today;
  no regressions across the gossip-heavy e2e suite)
- audit-dial-protocol still OK (no new dialProtocol sites)
- pnpm -r build clean

Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §5.4
Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #501 caught two bugs in the v0 encoding
that would have shipped if merged as-is:

Bug 1 — length collision
------------------------
Raw concatenation `topic ‖ payload ‖ from` is not injective on the
input tuple. With the same `from`, ("ab","c") and ("a","bc") hash
the same bytes and dedup as one message even though they're distinct
publishes.

Fix: u32_be length prefix per field. Distinct (topic, payload, from)
tuples now hash to distinct inputs.

Bug 2 — retries broken
----------------------
Pure content-derived msgIds dedup the same peer publishing the same
payload twice. Gossipsub uses msgId for cache-based dedup; without
seqno in the hash, a network blip + retry produces the same msgId
and the retry is silently dropped. The upstream `msgIdFnStrictSign`
includes seqno precisely for this reason.

Fix: include the gossipsub sequenceNumber for signed messages (u64_be
suffix). Unsigned messages use 0n (V10 doesn't ship unsigned today;
the branch exists so the function is total over the Message union).

Final encoding pinned by tests:

    msgId(topic, payload, fromIdentityId, seqno) :=
        sha256(
          u32_be(len(topic))           ‖ topic
          ‖ u32_be(len(payload))        ‖ payload
          ‖ u32_be(len(fromIdentityId)) ‖ fromIdentityId
          ‖ u64_be(seqno)
        )

Tests:
- core 11/11 in gossip-msg-id (was 8, +3 new for the codex-bug
  regressions: collision, retry, seqno-determinism)
- core 662/662 full suite (no other test relied on the old encoding)

Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #501 (round 2) identified a real
rolling-upgrade hazard:

  Gossipsub puts msgIds into its IHAVE/IWANT control protocol, so
  during a rolling upgrade where some nodes still use the upstream
  default `msgIdFnStrictSign` and some use `dkgGossipMsgId`, the two
  halves of the mesh compute different IDs for the same payload and
  stop correlating cache entries. Push delivery still works (the
  payload bytes on the wire are unchanged); only the dedup cache
  fragments, producing extra IWANT/SERVE round-trips per message
  until the upgrade completes.

The original RFC 07 §5.4 justification was "behaviour-invisible
today, locks in the constant before a second backend exists" — the
"behaviour-invisible" half was wrong.

Decision: defer the actual wiring. v1 ships:

- The `dkgGossipMsgId` function (length-framed, includes seqno).
- 11/11 encoding tests pinning the constant.
- Public re-exports so external code can reference the constant.

…but does NOT flip `msgIdFn` in `node.ts` yet. The flip is staged
with a coordinated mesh-wide upgrade, most likely combined with a
gossipsub protocol-version bump so old/new nodes segregate into
separate meshes during cutover instead of degrading a shared one.
Probably bundled with the introduction of a second gossip backend
(RFC 07 Phase 6), where the protocol-version bump is needed anyway.

Code changes:
- Removed the `msgIdFn: dkgGossipMsgId` line from `node.ts` and
  dropped the import. Replaced with a long inline NOTE pointing
  to the RFC and explaining why this is intentionally not wired.

Spec changes (filed separately in dkgv10-spec repo):
- Rewrote RFC 07 §5.4 to explain why wiring is deferred and what
  the cutover plan is.
- Updated Phase 5: "post-v1, coordinated" rather than
  "optional, ~zero risk".

Tests:
- core gossip-msg-id 11/11 (unchanged — function still ships)
- core build clean

Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #501 (round 3):

> This comment says the deferred wiring plan is tracked in
> `dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md`,
> but that path does not exist in this repo. That leaves future
> follow-up with no discoverable anchor.

True — the RFC 07 spec lives in a sibling repo (`dkgv10-spec`), not
in this one. Pointing at a path that doesn't resolve from this repo's
root would have led the next contributor on a wild goose chase.

Fix: reorganise the comment so the in-repo cross-references come
first (the function lives at packages/core/src/network/gossip-msg-id.ts
with tests at packages/core/test/gossip-msg-id.test.ts — both
discoverable from this repo today), and explicitly call out that the
spec doc is in the sibling `dkgv10-spec` repo. The summary above the
cross-references already covers the "why" so the wiring decision is
reviewable purely in-place.

Co-authored-by: Cursor <cursoragent@cursor.com>
…501)

Codex review feedback on PR #501 (round 4):

> For type: 'unsigned' this falls back to fromBytes = [] and seqno =
> 0n, so two different unsigned publishers sending the same payload
> on the same topic will produce the same msgId and one publish will
> be falsely deduplicated. Because this helper is now exported
> publicly, consumers can hit that collision even before core wires
> it in.

Real bug at the API surface level. The earlier draft handled
unsigned by zero-filling the publisher and seqno fields, which
preserves the upstream-default false-dedup property — fine if
nothing ever calls it with unsigned messages, but a footgun for
external consumers of the exported function.

V10 configures gossipsub with the StrictSign default, so unsigned
messages don't appear in the wild today. Throwing here makes the
"unsigned not supported in this msgId scheme" stance explicit:

  - any code path that tries to publish an unsigned message fails
    loudly (easy to debug),
  - external consumers of the exported function can't accidentally
    hit the false-dedup case,
  - a future PR that wants to support unsigned has to deliberately
    extend the scheme with a per-message identity (nonce / hash
    prefix / etc.), pinned by tests.

Implementation:
- New `DkgGossipUnsignedMessageError` class (also exported) with a
  message that explains the constraint and points at the workarounds.
- `dkgGossipMsgId` throws the error for `msg.type !== 'signed'` before
  building the hash input.
- Updated tests: replaced the "unsigned: length-framed sha256 with
  seqno=0n and empty from" case with "unsigned: throws
  DkgGossipUnsignedMessageError"; dropped the now-impossible
  "signed and unsigned with same topic/payload differ" case.

Tests:
- core gossip-msg-id 10/10 (was 11; net -1 because the
  signed-vs-unsigned diff test became unreachable)

Co-authored-by: Cursor <cursoragent@cursor.com>
…dex PR #501 round 5)

Codex review feedback flagged that the round-4 signature accepted a
libp2p `Message` directly, which made the "cross-backend dedup" framing
aspirational rather than concrete. A future iroh-gossip backend would
need to either duplicate the framing logic or shoehorn its messages
through the libp2p shape.

Split:
- `dkgGossipMsgIdRaw({ topic, data, publisherIdBytes, sequenceNumber })`
  is the backend-agnostic primitive; framing + sha256 lives here once.
- `dkgGossipMsgId(msg: libp2p.Message)` is now a thin adapter that
  unwraps libp2p shape into raw inputs and enforces signed-only.

Tests:
- Adds a FIXED VECTOR test that pins the exact 32-byte sha256 output
  for a known input, addressing Codex feedback that the existing
  `expected()` helper mirrors production logic and would mask
  encoding drift if both were mutated together.
- Adds direct tests for `dkgGossipMsgIdRaw` (length-framing,
  seqno-in-hash, 32-byte digest).
- Adds an adapter↔raw equivalence test proving the libp2p adapter
  delegates to the primitive on every signed message.

Marks the public API `@experimental` in JSDoc — the function is
intentionally unwired in v1 (see RFC 07 §5.4 + Phase 5 deferred
cutover) and downstream consumers shouldn't rely on the in-process
mesh routing through it yet.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ex PR #501 round 6)

The libp2p adapter (`dkgGossipMsgId`) rejects unsigned messages
because they have no publisher identity, but the raw primitive
(`dkgGossipMsgIdRaw`) used to silently accept the equivalent
`publisherIdBytes.length === 0` case — reopening the exact false-dedup
hazard the adapter was built to close. Any future backend or consumer
that forgot to plumb publisher-identity bytes through the raw
primitive would have shipped a quiet correctness bug at the
cross-backend boundary the primitive was meant to lock down.

Add `DkgGossipMissingPublisherError` (exported from network/index +
core/index) and throw it from `dkgGossipMsgIdRaw` on empty input.
Updated the "returns 32-byte SHA256" test to use a non-empty
publisher (the empty-publisher case is now an error, not a vector).
Added a regression test pinning the throw.

17/17 gossip-msg-id tests pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review skipped: filtered diff is 5983 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

Comment thread packages/agent/src/dkg-agent.ts Outdated
return;
}

if (opts?.relayCapable === true) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving false/absent as a no-op makes the on-chain flag sticky. The daemon calls this at startup with relayCapable: config.relayCapable === true, so a core that once ran with relayCapable: true and then disables relay service keeps ProfileStorage.relayCapable == true forever unless an operator manually flips it. That stale registry entry means future resolver/attestation logic can continue treating an opted-out node as a relay candidate. Since this method is documented as syncing the intended relayCapable flag, please compare current with the desired value and call setRelayCapable(false) when config is explicitly false (or preserve undefined vs false at the lifecycle call if manual flips must remain untouched).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in a4eb9df.

publishRelayRegistry now has explicit tri-state semantics:

  • true → ensure on-chain flag is true (idempotent)
  • false → ensure on-chain flag is false (actively clears stale prior opt-in)
  • undefined → no-op (preserves manual dkg admin set-relay-capable flips)

The lifecycle call site (packages/cli/src/daemon/lifecycle.ts:772) was the bigger problem — it was coercing config.relayCapable === true, destroying the undefined-vs-false distinction the resolver method was trying to support. It now passes config.relayCapable directly.

Updated:

  • packages/agent/src/dkg-agent.ts — branch on both true and false, anything else (including non-boolean misconfig) is treated as "no opinion"
  • packages/cli/src/daemon/lifecycle.ts — drop === true coercion, pass through
  • packages/cli/src/config.ts — docstring updated to document tri-state
  • packages/agent/test/publish-relay-registry.test.ts — 12 regression tests (fresh-flip, idempotent both directions, false-clears-stale, undefined-no-op, opts omitted, non-boolean ignored, no chain support, no profile, getIdentityId throws, no getRelayCapable, setRelayCapable rejects)

All 440 agent tests pass; both cli and agent packages typecheck clean.

…dex PR #506)

The previous version only acted on `opts.relayCapable === true`, treating
both `false` and `undefined` as a no-op. The lifecycle call site coerced
`config.relayCapable` to `=== true`, so a node that had ever started with
`relayCapable: true` would keep advertising relay capability on-chain
forever — even after the operator removed the flag from config. Stale
opt-ins meant future resolver / attestation logic could keep treating
opted-out nodes as relay candidates.

Tri-state semantics now (explicit per Codex review):
  - `true`      → ensure on-chain flag is true (idempotent, flips if false)
  - `false`     → ensure on-chain flag is false (idempotent, flips if true,
                  actively clearing stale prior opt-in)
  - `undefined` → leave on-chain alone, preserving any manual
                  `dkg admin set-relay-capable` flips

Changes: - `publishRelayRegistry` (dkg-agent.ts) checks both true and false
    explicitly; rejects non-booleans as "no opinion".
  - Lifecycle now passes `config.relayCapable` directly without `=== true`
    coercion, so the agent sees the operator's true/false/undefined intent.
  - Config docstring updated to document the tri-state.
  - 12 new regression tests in `publish-relay-registry.test.ts` pinning
    every branch (fresh true, idempotent true, false clears stale,
    idempotent false, undefined no-op, opts omitted, non-boolean ignored,
    no chain support, no profile, getIdentityId throws, no
    getRelayCapable, setRelayCapable rejects).
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review skipped: filtered diff is 6173 lines (cap: 5,000). Please consider splitting this into smaller PRs for reviewability.

@branarakic branarakic merged commit 5038e92 into main May 13, 2026
41 checks passed
branarakic pushed a commit that referenced this pull request May 13, 2026
Lockstep bumps every workspace package from 10.0.0-rc.6 → 10.0.0-rc.7
so a `pnpm install dkg-v10@10.0.0-rc.7` resolves to a coherent set.

`network/testnet.json#chainResetMarker` is intentionally NOT bumped —
rc.7 ships node-only changes (no contract redeploy), so the existing
`v10-rc6-pca-author-attestation-2026-05-10` marker is correct and
on-chain state must NOT be wiped on first boot.

CHANGELOG entry summarises rc.7 contents:
  - RFC 07 in-process PeerResolver (consolidated #494/#496/#497/#499/
    #501/#506) — every libp2p dial flows through one resolution chain;
    `/api/connect` returns precise HTTP codes (504/502/400) with
    structured error bodies; CI grep gate prevents bypass.
  - Backend-agnostic dkgGossipMsgId primitive (#501) — unified gossip
    message-id derivation across real libp2p and in-memory test bus.
  - devnet-test.sh sections 28-31 (#501/#506/#507) — resolver path,
    cross-node matrix, cold-restart resilience, full invite/join
    end-to-end. Knobs: SKIP_RESTART, SKIP_MATRIX, SKIP_INVITE_FLOW.
  - PCA wiring at registration (#502, rebase of #423) — PCA agents
    can publish to curated CGs without 500 fallthroughs; non-existent
    PCA accounts → clean 404; publishPolicy exposed on
    ApiClient.createContextGraph for dkg-node-ui.
  - relayCapable tri-state on chain (#506) — true ensures, false
    actively clears, undefined leaves untouched (preserves manual
    `dkg admin set-relay-capable` flips). 12 regression tests added.
  - SECTION 31 invite/join e2e in devnet smoke suite (#507) —
    asserts DKG_CURATOR triple, accept+persist log path,
    denied-without-phantom semantics, and post-approval _meta
    mirroring.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant