feat: Network Relay Registry (Phase 1) + RFC 07 In-Process PeerResolver (PRs #471, #494, #496, #497, #499, #501 bundled)#506
Conversation
…461) Phase 1A of the Network Relay Registry rollout. Extends `ProfileInfo` with two on-chain fields the rest of Phase 1 relies on: - `bool relayCapable` — operator-controlled opt-in that this node will serve circuit-relay traffic for the network. Phase 3 NetworkStateRegistry prefers chain-attested relays during peer discovery. - `string[] multiaddrs` — externally-reachable libp2p multiaddrs the node currently advertises. Phase 3 edges resolve circuit-relay paths from this set instead of the hardcoded network/*.json relay list. `Profile.sol` exposes two `onlyIdentityOwner`-gated entry points (`updateRelayCapable`, `updateMultiaddrs`) so a running node with an operational key can refresh its multiaddrs each restart without admin signing. `updateMultiaddrs` enforces modest bounds (≤8 entries, ≤256 bytes each) and emits `MultiaddrsUpdated` for off-chain consumers. Version bumps signal the storage-layout extension to operators: - ProfileStorage 1.0.0 → 1.1.0 (struct extended at the tail; existing keys read new fields as zero/empty). - Profile 1.1.0 → 1.2.0 (new entry points). 13 new Solidity unit tests covering happy-path, owner gating, bound violations, wholesale-replacement semantics, and the extended `getProfile` tuple. Co-authored-by: Cursor <cursoragent@cursor.com>
…#461) Phase 1B of the Network Relay Registry rollout. Surfaces the on-chain relay-registry fields added in Phase 1A through `ChainAdapter` so the agent / publisher / future NetworkStateRegistry don't have to reach into the Profile ABI directly. Four new optional methods on `ChainAdapter`: - `getRelayCapable(identityId)` / `getMultiaddrs(identityId)` — read any node's chain-attested relay state. - `setRelayCapable(bool)` / `setMultiaddrs(string[])` — publish this node's own state; resolves identityId from the bound signer. `EVMChainAdapter` resolves `ProfileStorage` (tolerated as optional so older Hub deployments still init), implements all four methods, and adds `MultiaddrsUpdated` + `RelayCapabilityUpdated` to `listenForEvents` so Phase 3 NetworkStateRegistry can refresh its cache without a follow-up storage read per peer. `MockChainAdapter` keeps an in-memory mirror keyed by identityId, with the same client-side bound checks (≤8 entries, ≤256 bytes each) so adapter-parity tests catch drift on either side. Events flow through the generic `pushEvent` / `listenForEvents` pipe. Tests: - 9 mock-adapter tests covering reads, writes, idempotence, bound violations, and the event surface. - 8 EVM-adapter tests against a real Hardhat node (round-trip, wholesale-replacement, fast-fail, event roundtrip). Existing chain test suite (231 tests, including ABI pinning) stays green. Co-authored-by: Cursor <cursoragent@cursor.com>
…4 / #461) Phase 1C of the Network Relay Registry rollout. Wires the chain-adapter surface from Phase 1B into the daemon startup so operators get on-chain relay attestation for free — every restart pushes the node's current externally-reachable multiaddrs (and optionally the relayCapable flag) to its on-chain Profile. `DKGAgent.publishRelayRegistry({ relayCapable })` is the new method: - Filters libp2p `multiaddrs` down to the externally-dialable subset (public IPv4/IPv6/DNS plus /p2p-circuit/ for NAT'd edges); local / RFC1918 / link-local entries are dropped. - Idempotent: compares the candidate set against what's already on chain and skips the tx when they match. Safe to call every restart. - Bound-clamped to MAX_MULTIADDRS=8 (mirrors ProfileLib). - Best-effort and no-throw: missing chain config, no profile, adapters without the relay-registry surface, and chain RPC errors are all logged-and-skipped rather than crashing the daemon. `isPublicLikeAddress` / `isLocalOrInternalHostname` are re-exported from `@origintrail-official/dkg-core` so the agent can reuse the same classifier the existing daemon dialability log uses. Daemon wiring fires the publish on a non-blocking setTimeout(0) right after the relay-reservation log, so chain RPC slowness can't gate daemon ready. New `relayCapable?: boolean` flag on `DkgConfig` lets operators opt into advertising relay service (off by default — relay provision has bandwidth/availability implications). 8 unit tests pin the filter behaviour (loopback / RFC1918 / link-local rejection, /p2p-circuit/ acceptance, dedup, defensive against runtime garbage). Full agent test suite (439 tests) stays green. Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end soak harness for the 6-node devnet (4 core + 2 edge):
- Bootstraps 6 random delegator wallets to reach 10 total staking wallets
(4 op-wallets from devnet.sh staking 50k each + 6 fresh delegators
staking 25k each on cores 1-4 in a 2/2/1/1 split). Asymmetric stake
on purpose so RS score-per-stake variance is observable.
- Drives a positive workload: round-robin `dkg publish` across both
registered CGs and across all 4 core nodes via the existing CLI surface
(no chain-direct shortcuts beyond the bootstrap setup that devnet.sh
already does for staking).
- Runs three background loops for the duration:
* publisher loop — N-Quad publishes every PUBLISH_INTERVAL_SEC
* observer loop — snapshots every node /api/random-sampling/status,
/api/status, /api/connections + on-chain
getNodeEpochScore / getNetNodeEpochRewards every
SNAPSHOT_INTERVAL_SEC
* event listener — polls RandomSamplingStorage for
EpochNodeValidProofsCountIncremented +
NodeEpochScoreAdded + ActiveProofPeriodStartBlockSet
(the actual on-chain proof signal — RandomSampling
itself does not emit a ProofSubmitted event)
- Negative scenarios woven in:
* publish to unregistered CG -> expect HTTP 4xx
* edge nodes 5+6 must report `enabled=false` and submittedCount=0
across every snapshot
* mid-soak: gracefully bounce one core for ~120s, assert API recovers
and prover resumes (skip with SOAK_SKIP_RESTART=1)
- Final report at OUT_DIR/findings.md with pass/warn/fail breakdown,
per-core stake/score/reward table, and per-delegator rolling rewards.
Exits non-zero on any FAIL.
Also shortens the development Chronos.epochLength from 30 days to 30
minutes. The default makes a 2h soak observe 0.28% of an epoch — neither
setNetNodeEpochRewards nor delegator rolling rewards ever fire, so the
test cannot exercise reward distribution. 30 minutes lets a 2h run cross
~3 epoch boundaries and watch the full accrue -> finalize -> claim cycle.
3-min smoke against a fresh devnet clears 20/20 assertions:
- 15/15 publishes confirmed on chain
- All 4 cores submit proofs (1->10, 2->4, 3->4, 4->2 over 4 minutes)
- 10 EpochNodeValidProofsCountIncremented events from 4 distinct ids
- Edge nodes 5+6 stay disabled across all snapshots
- Score asymmetry visible: high-stake core 1 score 1.87e18 vs low-stake
core 4 score 1.02e17 (both proving)
Companion to PR #471 (Network Relay Registry Phase 1).
Co-authored-by: Cursor <cursoragent@cursor.com>
Reverts the development.Chronos.epochLength change from 30d -> 30min that landed alongside the soak runner. The shorter epoch was intended to let a 2h soak observe epoch transitions, but it also broke unrelated Solidity tests (3 of 4 hardhat shards red on PR #471) that have hardcoded expectations on the canonical 30-day dev epoch. The soak script now reads the deployed Chronos.epochLength on startup and surfaces a clear instruction when the epoch is too long for the soak duration: edit parameters.json + clean+restart devnet manually. We deliberately do not patch parameters.json from the script because it lives on the Solidity test path; touching it implicitly would break CI for everyone running the soak in a side branch. Trade-off: with the default 30d epoch the 2h soak still validates RS prover ticks, on-chain submitProof events, in-epoch score accrual, and publish + edge-disable + restart-recovery. Reward DISTRIBUTION (setNetNodeEpochRewards + delegator-roll updates) is the one thing suppressed — operator runs the manual flow if they care about that. Co-authored-by: Cursor <cursoragent@cursor.com>
Round 1 of the 2h soak surfaced three issues with the test harness
itself (not the node code):
1. The heartbeat one-liner
pcount=$([ -f X ] && wc -l < X | tr -d ' ' || echo 0)
blew up at the END_TS boundary with `=NNNN: command not found`
because of bash operator precedence interactions between `&&`,
`|`, and `||` inside `$(...)` under load. Replaced with a plain
`safe_line_count` helper so the heartbeat survives weird filesystem
timing.
2. The RS-prover assertion read `submittedCount` from the LAST
snapshot, but the last 1-2 snapshots can be empty (`rs: {}`,
`uptime_ms: null`) when the agent or the snapshot fetcher is
mid-shutdown — a healthy core that submitted 31 proofs then
misreports as "never submitted". Switched to PEAK across the full
timeseries; non-empty `rs` snapshots are tracked separately and
reported as a soft warning.
3. The `epochScore`-growth assertion compared start vs. end of soak,
but `epochScore` resets to 0 at every epoch transition. With a
30-min epoch and a 2h window, the final snapshot lands in a fresh
epoch and reports all zeros. Switched to aggregating per-identity
total score from `NodeEpochScoreAdded` chain events across ALL
epochs in the soak — gives a clean stake-weighted asymmetry view
regardless of when the run starts/ends in the epoch cycle.
Re-running the new logic against R1's intact data: 22 ok / 1 warn /
0 fail. Stake-weighted asymmetry now visible: identity 1 (100k stake)
earned ~2.6× the score of identity 4 (75k stake).
Also clarified the `netNodeEpochRewards=0` "warning" — that field is
populated lazily on first reward claim (StakingV10 path), not at
epoch transition. The soak observes accrual via `epochScore` events,
which is the actual signal of reward-pipeline liveness.
Co-authored-by: Cursor <cursoragent@cursor.com>
…PI gap
The "OBSERVER N snapshots had empty rs:{}" warning was conflating two
unrelated cases: edge nodes that legitimately have `loop: null` (no
prover, role-gated), and core nodes where the agent API was briefly
unavailable. R1 reported "34 empty rs" mostly because of edges 5+6
appearing in every snapshot.
Now we only count core-node snapshots with the entire `rs` object
missing/empty (a real API transient), and break the count down per
node so the operator can see where the gap was. R1 re-evaluated:
node2=2, node3=20, node4=12 — node3 at ~19% gap rate is worth
keeping an eye on across rounds.
Co-authored-by: Cursor <cursoragent@cursor.com>
R2 finished with `started epoch: 8 → ended epoch: 13` (5 transitions) but the assertion misreported "no epoch transition observed". Cause: `stakeAfter.epoch` and `stakeBefore.epoch` are JSON strings, and the bare `>` operator does lexicographic comparison — "13" > "8" returns false. Switched to BigInt() for a numeric compare. R2 also surfaced an operational finding worth recording (no code fix needed): the in-process oxigraph backend (nodes 3+4) caused 54-66% API gap rates by R2, up from 11-19% in R1, as the KC corpus grew. The peak-aggregation logic correctly extracted the right answer across the gaps; the warning is informational about the backend trade-off, not a defect in the node. Co-authored-by: Cursor <cursoragent@cursor.com>
RFC 04 Network State Registry v0.3 narrows the on-chain Profile
surface: multiaddrs no longer live on Profile (they will live in
per-RS-round attestation KCs minted by submitProofV2 in Phase 2).
Reverts the multiaddrs storage path introduced earlier in this PR
while preserving the relayCapable flag and surrounding scaffolding,
which Phase 2 still needs.
Surgical revert only — net -733 lines:
Contracts (evm-module)
ProfileLib.sol: drop multiaddrs[] field, MAX_MULTIADDRS /
MAX_MULTIADDR_LENGTH constants, three multiaddr error types.
Keep relayCapable.
ProfileStorage.sol: drop MultiaddrsUpdated event,
getMultiaddrs/setMultiaddrs methods. getProfile no longer
returns string[]. Keep RelayCapabilityUpdated +
getRelayCapable/setRelayCapable.
Profile.sol: drop updateMultiaddrs entry point + its validation.
Keep updateRelayCapable.
Profile.test.ts: drop multiaddrs describe block; keep relayCapable
suite (6 tests). All 27 unit tests pass.
ABIs regenerated from compiled contracts.
Chain adapter
chain-adapter.ts: drop optional getMultiaddrs/setMultiaddrs from
interface.
evm-adapter.ts: drop concrete impl + MAX_MULTIADDRS constants +
MultiaddrsUpdated event handling.
mock-adapter.ts: drop multiaddrsByIdentity map + impl + event.
Adapter tests refactored to cover only the relayCapable round-trip
and event surface. All 229 chain tests pass.
Daemon
dkg-agent.ts: drop filterPublishableMultiaddrs helper. Slim
publishRelayRegistry() to flag-only sync (no candidate filter,
no clamp, no diff, no setMultiaddrs).
relay-registry-publish.test.ts: deleted (its only target is gone).
lifecycle.ts: comment updated to match flag-only semantics.
cli/config.ts: relayCapable doc updated to point at RFC 04 v0.3 §5.2.
core/index.ts: revert isPublicLikeAddress / isLocalOrInternalHostname
re-exports (only consumer in dkg-agent.ts is gone).
All 431 agent tests pass.
Doc cross-link
messenger.ts: comment now points at 04_NETWORK_STATE_REGISTRY.md
+ 07_IN_PROCESS_PEER_RESOLVER.md.
Validation:
npx hardhat compile: clean
pnpm -F dkg-core dkg-chain dkg-agent dkg build: all green
Lint: clean across all touched files
Profile.test.ts: 27/27
chain tests: 229/229
agent tests: 431/431
This is Phase 0 of RFC 04 v0.3 §8 migration plan. Phase 1 (relayCapable
flag + adapter scaffolding + daemon plumbing) was already complete on
this branch. Phase 2 (submitProofV2, attestation KCs) lands separately.
Co-authored-by: Cursor <cursoragent@cursor.com>
…PR-1) Introduces the transport-pluggable boundary RFC 07 §5 commits to: - packages/core/src/network/network.ts — Network, NodeIdentity, Address, DialOpts, ProtocolHandler. The minimum surface every dial-path consumer needs (sync, chat, skill invoke, /api/connect, …) so the application stops importing libp2p directly outside one location. - packages/core/src/network/libp2p-network.ts — LibP2PNetwork is a thin facade over an existing DKGNode; no state is duplicated, lifecycle is shared. - packages/core/test/libp2p-network.test.ts — exercises the contract against two real libp2p instances (dial round-trip, getConnections, addKnownAddresses, unhandle, lifecycle idempotency). This is purely additive. DKGNode.libp2p direct accessor stays during the RFC 07 transition; no consumer is migrated in this PR. Subsequent PRs introduce PeerResolver (PR-2) and migrate Messenger / ProtocolRouter / /api/connect (PR-3 / PR-4) to consume Network, ending in a CI grep gate that disallows raw dialProtocol(peerId) outside peer-resolver.ts. Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §5 Co-authored-by: Cursor <cursoragent@cursor.com>
…Network (Codex PR #494) LibP2PNetwork's first cut silently dropped two behaviours that the existing pre-RFC-07 dial path (ProtocolRouter) relied on. The first consumer migrated to Network.dialProtocol() / Network.handle() would have regressed on relay-only links and on async handler rejections. Two fixes: 1. dialProtocol() and handle() now both pass `runOnLimitedConnection: true` like ProtocolRouter.send / ProtocolRouter.register already do. V10 edge↔core traffic regularly traverses circuit-relay limited connections (RFC 04); without this flag, libp2p refuses to use a limited-only connection for protocol streams. Codex called this out on dialProtocol; same fix applies to inbound handle(). 2. handle() now awaits the user-supplied handler in a try/catch and stream.abort()s on failure. The previous wrapper fire-and-forgot the promise, so an async handler rejection became an unhandledRejection AND left the inbound stream half-open until libp2p's own timeout fired. Mirrors the try/catch+abort pattern in ProtocolRouter.register exactly. New test (handler-rejects/1.0.0) is a regression guard: a handler that intentionally throws after a 50ms barrier; the dialer's drain must terminate in well under 5s. Without the fix it hangs to the 10s vitest test timeout. 8/8 LibP2PNetwork tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>
Introduces the single in-process service every dial path will consult
before dialProtocol. PR-2 migrates Messenger as the first consumer;
ProtocolRouter and /api/connect follow in PR-3 and PR-4.
New (packages/core):
- src/network/network-state-registry.ts — NetworkStateRegistry interface
+ StubNetworkStateRegistry. Returns empty until RFC 04 Phase 2 lands;
the interface is fixed so the future PR can swap the impl without
touching consumers.
- src/network/peer-resolver.ts — PeerResolver implementing RFC 07 §3.1
resolution order:
1. Active-connection cache (network.getConnections)
2. DHT lookup (network.findPeer with per-step timeout)
3. NetworkStateRegistry (stubbed)
4. AgentDirectoryLookup (agents-CG SPARQL fallback; minimal interface
defined here so packages/core has no dependency on packages/agent)
5. Bootstrap seeds (only when 1-4 produced nothing)
Each step that finds addresses calls network.addKnownAddresses so the
transport's address book is primed before the caller dials. Returns
deduplicated, freshest-first; never throws.
- test/peer-resolver.test.ts — 13 cases covering the resolution order,
failure tolerance, dedup, and the stubbed health-scoring surface.
Migrated (packages/agent):
- src/p2p/messenger.ts — drops Libp2pLike + DiscoveryClient deps;
consumes PeerResolver instead. The chat-only inline patching path
(ensureCircuitRelayAddress) is now subsumed by step 4 of the
resolver — same SPARQL lookup, same /p2p-circuit/p2p/<peerId>
shape, but available to every dial path that consults the resolver.
- test/p2p-messenger.test.ts — slimmed to verify the structural
property (resolver fires before router.send) and timeout forwarding;
resolution-order cases moved to peer-resolver.test.ts where they
belong.
- src/dkg-agent.ts — wires LibP2PNetwork + PeerResolver +
StubNetworkStateRegistry, exposes peerResolver as a public field
so PR-3 / PR-4 can reuse the same instance.
Tests:
- 648/648 pass in core (+13 new)
- 429/429 pass in agent
Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §3
Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #496: 1. Bootstrap seeds are addresses of seed peers (curators/relays), NOT addresses for the requested target peerId. Returning them as candidates would either fail loudly (libp2p rejects /p2p/<seed> vs requested target peerId mismatch) or quietly mislead the caller. Bootstrap is a libp2p-startup concern (`bootstrap({ list })` peerDiscovery in node.ts) — not a per-peer resolution concern. Removed step 5 entirely, dropped bootstrapSeeds from PeerResolverDeps, updated dkg-agent.ts construction site, dropped two bootstrap-related test cases, kept the "returns empty array when nothing resolves" case (now without seeds in the picture). 2. Registry step (step 3) was the only lookup path not in try/catch, contradicting the documented "never throws" contract. Once the real NetworkStateRegistry lands (DB / mutex / SWM-backed), a transient hiccup would abort the whole resolve() and silently skip steps 4+. Wrapped registry.lookup() + addKnownAddresses() in the same best-effort handler the DHT and agents-CG steps use, added a regression test (registry throws → falls through to step 4 cleanly). Tests: - core 13/13 in peer-resolver.test.ts (was 13, refactored 2 cases, added 1 case) - agent build clean Co-authored-by: Cursor <cursoragent@cursor.com>
…nger Codex review feedback on PR #496 (round 2): 1. PeerResolver step 1 (`network.getConnections(peerId)`) was outside any try/catch. The class doc-comment promises `resolve()` never throws, but `LibP2PNetwork.getConnections` calls `peerIdFromString(peerId)` under the hood and that throws on a malformed peerId. Callers downstream (ProtocolRouter, connectToPeerId) would see an unhandled exception instead of an empty result they could surface as PEER_NOT_FOUND / DIAL_FAILED. Fix: wrap the live-conn lookup in the same best-effort handler the later steps use. A bad peerId now falls through, every later step fails identically, the resolver returns []. Added a regression test that simulates `getConnections` throwing. 2. Messenger.sendToPeer was firing `resolver.resolve(peer)` with no options, so a caller passing `opts.timeoutMs: 1000` could block 5s+ on the resolver's default per-step DHT timeout before the router.send dial loop even started. Fix: when timeoutMs is provided, pass an AbortSignal + perStepTimeoutMs through to resolve(). PR-3 of the RFC 07 stack moves this resolver call into ProtocolRouter and removes the inline call entirely; until then this passthrough keeps PR #496 sound on its own merits. Tests: - core peer-resolver 14/14 (was 13, +1 for live-conn try/catch) - core + agent build clean Co-authored-by: Cursor <cursoragent@cursor.com>
…budget passthrough Codex review feedback on PR #496 (round 3): 1. perStepTimeoutMs silently lost when opts.signal is set. `LibP2PNetwork.findPeer` honoured the signal and ignored the timeoutMs, so any caller passing both would silently lose the per-step cap. The exported `ResolveOpts` contract said both were real. Fix: compose a step-local AbortSignal that combines BOTH inputs (`AbortSignal.any([AbortSignal.timeout(perStepMs), opts.signal])`) and pass that to `network.findPeer`. The contract now matches runtime: per-step cap fires at min(perStepMs, time-until-outer- abort). 2. Outer signal not checked between resolver steps. PR #499 round 2 added a single AbortSignal to bound the whole `connectToPeerId` path, but only step 2 (DHT) propagated it to `findPeer`. After a DHT timeout the registry + agents-CG steps still ran unbounded, defeating the deadline guarantee. Fix: check `opts.signal?.aborted` before each step. An aborted signal returns whatever's been accumulated so far (or [] if nothing yet). Added two regression tests: pre-aborted signal skips all steps, and mid-resolve abort prevents later steps from running. 3. Messenger.timeoutMs double-spent (round-2 regression). Round 1 fix added `opts.timeoutMs` passthrough to `Messenger → resolver.resolve()` so the resolver wouldn't run unbounded. But `router.send()` then ALSO consumed `opts.timeoutMs` with a fresh budget, so cold-peer sends could block ~2x the advertised timeout. Fix: revert the round-1 passthrough. The proper fix would share a single AbortSignal across resolver + router.send, but `ProtocolRouter` doesn't accept an external signal today (it builds its own internally). Plumbing that through is its own refactor — and PR-3 of the RFC 07 stack already moves this code path INTO ProtocolRouter (where the budget can be shared without leaking through public surfaces) and reduces Messenger to a pass-through. This entire Messenger code path is transient; reverting the round-1 fix avoids the double-spend without needing a router refactor that PR-3 supersedes anyway. The pre-PR Messenger callers (chat / sync) already used generous timeouts (≥30s), so the resolver's 5s default doesn't push them over. Documented the trade-off in the call site. Tests: - core peer-resolver 17/17 (was 14, +3 round-3 tests) - core + agent build clean Co-authored-by: Cursor <cursoragent@cursor.com>
…Codex PR #496) Codex review feedback on PR #496 (round 4): 1. NetworkStateRegistry.lookup was sync (returned Address[]). The v1 stub is in-memory and could be sync, but the doc explicitly said "the real RFC 04 implementation may need to hit a database / mutex / SWM channel". Freezing the contract as sync now would either force the real impl to block the event loop or require a public-API breaking change in @origintrail-official/dkg-core later. Fix: change `lookup` to return `Promise<Address[]>`. The v1 stub returns a resolved promise instantly (no observable cost). Keeps the upgrade path open from day one. 2. AgentDirectoryLookup.findRelayForPeer had no signal param, so step 4 of PeerResolver couldn't cancel an in-flight SPARQL query when the outer signal aborted. The pre-step `signal.aborted` check added in round 3 prevents NEW step-4 work from starting once the deadline elapses, but a query that's already running would block until SPARQL returned naturally — defeating the `connectToPeerId({ timeoutMs })` end-to-end deadline guarantee. Fix: add an optional `opts.signal` param to `AgentDirectoryLookup.findRelayForPeer`. Resolver passes `opts?.signal` through. The legacy DiscoveryClient wrapper in dkg-agent.ts intentionally ignores it for now (DiscoveryClient doesn't expose cancellation today; wiring it through SPARQL is a follow-up that touches DiscoveryClient internals). The contract is best-effort — implementations that can cancel get the benefit; those that can't keep working as before. Pinned by a new test "step 4 (Codex PR #496 round 4): forwards opts.signal to agentDirectory.findRelayForPeer". Tests: - core peer-resolver 18/18 (was 17, +1 round-4 test, 1 existing test updated for the new findRelayForPeer signature) - core + agent build clean - no other consumers of the changed interfaces (only PeerResolver + tests + dkg-agent.ts construction site) Co-authored-by: Cursor <cursoragent@cursor.com>
…odex PR #496 round 5) Codex review feedback: the previous revision dropped opts.signal in the only production AgentDirectoryLookup, leaving the resolver's documented cancellation guarantee unhonored. A slow SPARQL findAgentByPeerId could outlive the caller's deadline. DiscoveryClient itself doesn't (yet) accept an AbortSignal, but honoring the contract at the adapter boundary is enough: the adapter races the lookup against the abort signal, so when the signal fires the resolver gets an immediate `null` and can move on or surface a timeout. The underlying SPARQL fetch leaks (completes in the background and its result is discarded), bounded by the discovery client's own internal timeout. Acceptable trade-off because: - RFC 04 Phase 2 replaces this fallback path entirely - the alternative (full DiscoveryClient signal threading) is out of scope for this PR - the leak is bounded by an existing internal timeout, not unbounded The resolver pre-step `signal.aborted` check (added in round 3) still prevents NEW agent-directory work from starting once the deadline elapses; this round-5 race covers the case where the SPARQL is already in flight when the signal fires. Co-authored-by: Cursor <cursoragent@cursor.com>
…ds (Codex PR #496 round 6) The resolver's contract is "non-empty Address[] return ⇒ peerStore is primed for these addresses". Earlier rounds appended addrs to `accumulated` BEFORE awaiting `addKnownAddresses`, so a peerStore merge failure (malformed/stale multiaddr, transient peerStore error) left the caller seeing successful resolution while the bare peer-id dial silently missed every address. Fix: extract a `primeAndAppend()` helper used by steps 2-4 (DHT, registry, agents-CG) that awaits the merge first and only pushes the addr into `accumulated` on success. Step 1 (live-conn) doesn't need it — those addrs come straight from active connections, no priming required. Regression test: a network mock whose addKnownAddresses always throws returns [] instead of advertising "primed" addrs the dial would miss. 19/19 PeerResolver tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>
… (Codex PR #499 round 5) The race window: between the entry-time `if (opts?.signal?.aborted)` check and `signal.addEventListener('abort', ...)` inside the new Promise constructor, the signal could fire. Since the `abort` event is one-shot, our late listener would never see it and the Promise would hang for the full lookup duration — defeating the entire point of the race wrapper. Fix: re-check `signal.aborted` INSIDE the Promise constructor (atomic with the addEventListener call). If we lost the race, resolve null immediately. Same hoisting move pulls `opts.signal` into a stable `signal` const so the constructor closure doesn't have to assert non-null on each access. Co-authored-by: Cursor <cursoragent@cursor.com>
… PR-3) ProtocolRouter.send() now consults the PeerResolver before dialing, so every protocol that goes through the router (sync, skill invoke, OpenClaw channel, query-remote, storage-ack, verify-proposal, join-request, swm-sender-key, plus all of Messenger's chat traffic that already routes through here) hits the same resolution order that Messenger.sendToPeer got in PR-2. Changes: - packages/core/src/protocol-router.ts: introduces ProtocolRouterOptions with an optional peerResolver. send() invokes resolver.resolve(peerIdStr) best-effort before the dial loop. Resolver step 1 (live-conn cache) is sub-millisecond, so the cost is invisible when a connection already exists; the win is on cold dials, where the resolver primes the libp2p peerStore with whatever the resolution order finds (live-conn → DHT → RFC 04 registry stub → agents-CG → bootstrap). peerResolver is optional during the migration; tests that don't exercise resolution can omit it. PR-4 adds the CI grep gate that enforces the property structurally. - packages/core/src/index.ts: exports ProtocolRouterOptions. - packages/agent/src/dkg-agent.ts: reorders construction to build network → resolver → router → messenger so the same resolver is shared across both consumers. - packages/core/test/protocol-router-resolver.test.ts: 3 cases verifying the resolver is consulted exactly once per send with the correct peerId, that resolver failures don't block send, and that omitting the resolver preserves legacy behaviour. Tests: - 651/651 pass in core (+3 new) - 429/429 pass in agent (no regressions from the constructor reorder) Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §3.2 Co-authored-by: Cursor <cursoragent@cursor.com>
…er resolve Codex review feedback on PR #497: 1. ProtocolRouter.send was calling peerResolver.resolve() with no options, so the resolver ran with its default 5_000ms per-step DHT timeout regardless of the caller's `timeoutMs`. A caller passing `timeoutMs: 1000` could block 5s+ on resolution before the dial even started, blowing the timeout budget by 5x. Fix: pass the same AbortSignal + remaining time budget to the resolver. The dial loop signal is already constructed at the top of send(); reusing it ensures a single end-to-end deadline covers resolution + dial + read + retries. 2. Messenger.sendToPeer was doing its own peerResolver.resolve() call before delegating to router.send(), which (since PR-3) does the same lookup. Cold sends paid the DHT walk twice. Removed the duplicate from Messenger; the resolver dependency is now owned by ProtocolRouter alone, and Messenger is a thin pass-through to the router. Updated MessengerDeps + dkg-agent.ts construction site. Test rewrite: p2p-messenger.test.ts no longer mocks a resolver (Messenger doesn't hold one); the structural property "resolver primes peerStore before dialProtocol" is verified at the router layer in protocol-router-resolver.test.ts. Tests: - core 16/16 in peer-resolver + protocol-router-resolver - agent 3/3 in p2p-messenger - core + agent build clean Co-authored-by: Cursor <cursoragent@cursor.com>
…ex PR #497) Codex review feedback on PR #497 (round 3): > the regression test doesn't actually prove the new guarantee. It > warms the connection first (a.libp2p.dial(...) on line 40), so > router.send() can succeed even if the resolver is never consulted > before the first dial, and there is no assertion on dialProtocol > call order. Correct — pre-warming meant peerStore was already populated via the identify exchange, so even a no-op resolver wouldn't have caused send() to fail. The test couldn't catch the PR-448-class bug it was written to defend against. Rewrite strategy: 1. Keep the peers cold (no pre-dial). 2. Assert peerStore for B is empty BEFORE routerA.send(). 3. Wire the resolver to populate peerStore with B's real addr when called. 4. Send succeeds → only possible if the resolver ran before dialProtocol. (Otherwise libp2p has nothing to dial.) 5. Assert peerStore for B is populated AFTER send. 6. Assert resolver.resolve was called exactly once. The "send-succeeded-against-cold-peerStore" property is the structural proof of ordering — no dialProtocol instrumentation needed (which had teardown-race issues in an earlier attempt). Split out the existing AbortSignal + perStepTimeoutMs assertion into its own test so each test asserts one property; that test still pre-warms (it's testing arguments, not the priming side effect). Tests: - core protocol-router-resolver 4/4 (was 3, +1 for the cold-peer ordering rewrite; the "AbortSignal+perStepTimeoutMs" assertion split into its own test) Co-authored-by: Cursor <cursoragent@cursor.com>
…esolver (Codex PR #497 round 5) Codex review feedback: keeping peerResolver optional makes the cold- peer priming guarantee implicit. Any future `new ProtocolRouter(node)` on an outbound path would compile and silently skip priming, reintroducing PR-448's class of relay-route failures. Two mitigations are now in place: 1. CI grep gate (`scripts/audit-dial-protocol.mjs` from PR #499) bans raw `dialProtocol(peerId)` calls outside an allowlist, so any new outbound path is forced through ProtocolRouter. 2. (this commit) `send()` emits a one-time `console.warn` the first time it runs without a peerResolver, surfacing misconfiguration loudly at first cold dial rather than silently regressing. The full enforcement (`peerResolver` becoming required) is intentionally deferred — it would be a breaking change to test fixtures and edge callers that legitimately route over warm connections only. The warn+grep combo turns the silent-bypass risk into a build-time + first-call-time surface. Adds a test that asserts: (a) the warn fires, (b) only once across multiple sends, and (c) references RFC 07 / PR #448 in the message. Co-authored-by: Cursor <cursoragent@cursor.com>
This is the last structural piece of RFC 07 — after this lands, every peer-dial path in the daemon goes through PeerResolver, and the CI grep gate prevents anyone from adding a new dialProtocol( site that silently bypasses the resolver. Changes: - packages/agent/src/dkg-agent.ts: connectToPeerId no longer hand-rolls a DHT walk + peerStore merge; it delegates to peerResolver.resolve(), which runs the full RFC 07 §3.1 order (live conn → DHT → RFC 04 registry stub → agents-CG fallback → bootstrap seeds). The agents-CG fallback in particular is a new capability /api/connect didn't have: a peer whose DHT record is stale but whose relay is still advertised in the agents context graph is now reachable. Saves 80 LoC. Known regression vs PR #431: the legacy walk distinguished DHT_TIMEOUT (504) and DHT_UNAVAILABLE (503) from PEER_NOT_FOUND because the inline path saw the underlying error shape. The resolver is best-effort and swallows per-step errors (returns []), so both transient cases now collapse into 404. The 502 "addrs found but dial failed" distinction is preserved. If reviewers want the 503/504 codes back, the follow-up is a resolveWithDiagnostics method on PeerResolver — out of scope here. - scripts/audit-dial-protocol.mjs: walks every production .ts/.js source file under packages/, fails the build if dialProtocol( appears outside the explicit allowlist (LibP2PNetwork wrapper + ProtocolRouter). Reuses the bypass-resistant comment/string lexer from audit-create-random.mjs so dialProtocol( inside comments, strings, or regex literals doesn't false-positive. - scripts/audit-dial-protocol.test.mjs: 7 node:test cases for the audit's regex / lexer integration (split invocations, comments, strings, multi-hit, no false-positives on dial / dialMultiselect). - .github/workflows/ci.yml: wires the audit + its test into the build job's pre-install audits, alongside the existing audit-create-random gate. Verification: - `node scripts/audit-dial-protocol.mjs` — passes (2 allowlisted, no violations elsewhere) - `node --test scripts/audit-dial-protocol.test.mjs` — 7/7 pass - `pnpm --filter dkg-core test` — 651/651 - `pnpm --filter dkg-agent test` — 429/429 - `pnpm --filter dkg-agent exec vitest run test/e2e-dht-dial.test.ts` — 3/3 (validates connectToPeerId via PeerResolver against real libp2p nodes; new log line "Resolving X via PeerResolver..." → "Resolved X → 3 addr(s); dialling..." confirms the new path) - `pnpm -r build` — clean across all 14 packages Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §3.2 Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #499: 1. connectToPeerId was passing `timeoutMs` as the resolver's per-step budget AND reusing it for the final libp2p.dial(), so a slow DHT walk plus a slow dial could exceed the caller's deadline by ~2x. Fix: build one AbortSignal at the top, thread it through both the resolver call and the dial. The resolver already supports `signal` (added in PR-3 fixup). perStepTimeoutMs is now `max(0, remaining)` so a near-expired budget doesn't get re-extended for the next step. 2. The audit-dial-protocol regex only matched `.dialProtocol(`, leaving `?.dialProtocol(` and bracket-access forms (`["dialProtocol"](...)`, `?.["dialProtocol"](...)`, etc.) as easy bypasses. Since the script is supposed to enforce RFC 07 §3.2 structurally, those forms have to be caught too. Fix: split detection into two regexes: - MEMBER_ACCESS_RE catches `.dialProtocol(` and `?.dialProtocol(` after lexer-stripping comments+strings. - BRACKET_ACCESS_RE scans the ORIGINAL text for the bracket form (any quote style: ", ', or `), then filters to positions where the opening `[` survives in the stripped text (so commented-out and in-string occurrences are excluded). Codex's specific suggestion was to add regression tests alongside the existing newline/comment cases — done. Tests: - 15/15 in audit-dial-protocol.test.mjs (was 7, +8 new for the bracket/optional/comment-bracket/string-bracket/non-target-bracket cases) - audit-dial-protocol still OK against the production tree (2 allowlisted, no violations) - e2e-dht-dial 3/3 (validates connectToPeerId with single deadline) Note: PR #499's third Codex finding (PEER_NOT_FOUND unreachable when bootstrapPeers configured) is fixed by PR #496's fixup that drops the bootstrap-seeds step from the resolver. This branch picks that up via rebase. Co-authored-by: Cursor <cursoragent@cursor.com>
…499) Codex review feedback on PR #499 (round 2): 1. Bracket regex bypassable with comments inside the brackets. `libp2p[/*x*/"dialProtocol"](...)` and `libp2p["dialProtocol"/*x*/](...)` are valid JS and the round-1 regex required only whitespace between `[`, the string, and `]`, so both forms walked past the audit. Round 2 replaces the inline `\s*` with a `(?:\s | block-comment | line-comment)*` repetition so any combination of whitespace and inline comments is accepted. Added 5 regression tests covering: - comment between [ and string - comment between string and ] - comments on both sides - line-comment + newline interleaved - bracket form fully inside an outer block comment (still ignored) 2. Per-line dedup was wrong. Two distinct dialProtocol( calls on the same line collapsed into one hit, so an allowlisted file with `expectedHits: 1` could silently grow a second raw dial on the same line without failing CI. Round 2 dedupes by source INDEX instead, and added two regression tests: - two member-access calls on the same line → 2 hits - one member + one bracket on the same line → 2 hits 3. Doc-comment quoted block-comment delimiters in code examples, which terminated the JSDoc block prematurely (parse error). Spelled out the delimiters in prose to keep the file self-parsing. Tests: - 22/22 in audit-dial-protocol.test.mjs (was 15, +7 round-2 cases) - production tree audit still OK (2 allowlisted, no violations) Co-authored-by: Cursor <cursoragent@cursor.com>
… doc (Codex PR #499) Codex review feedback on PR #499 (round 3): 1. Audit still misses optional-CALL forms. `libp2p.dialProtocol?.(...)` and `libp2p["dialProtocol"]?.(...)` both execute the raw dial path while bypassing the round-2 detector. The `?.` between the property access and the open paren is a different position from the property-access optional chain the round-2 fix covered. Fix: extend MEMBER_ACCESS_RE and BRACKET_ACCESS_RE to optionally accept `?.` after the property/bracket and before the `(`. Added 4 regression tests: - foo.dialProtocol?.( - foo?.dialProtocol?.( - foo["dialProtocol"]?.( - foo?.["dialProtocol"]?.( All pass; production tree audit still OK (2 allowlisted, no violations). Audit suite now: 26/26 (was 22). 2. dkg-agent.ts docstring still mentioned "bootstrap seeds" as a resolver step. PR #496's fixup dropped that step entirely, so the docstring was now misleading future readers about a nonexistent fallback. Fix: removed "bootstrap seeds" from the resolution-order list, removed it from the PEER_NOT_FOUND error description, and added a short note explaining WHY the step was removed (cite Codex feedback on PR #496) so a future contributor doesn't try to re-add it. Bootstrap remains a libp2p-startup concern. Co-authored-by: Cursor <cursoragent@cursor.com>
…(Codex PR #499 round 5) Codex review feedback: PR #499's resolver migration collapsed all "resolver returned []" cases into PEER_NOT_FOUND. Since /api/connect upstream maps 404 to a terminal "wrong peer id" outcome and 504 to a retriable infrastructure error, this turned every DHT timeout / abort / network blip into a permanent input error from the UI's perspective — a public-API regression vs PR #431. Fix: - `connectToPeerId` now distinguishes "resolver returned [] AND signal aborted" → CONNECT_TIMEOUT (504, retriable) from "resolver returned [] cleanly" → PEER_NOT_FOUND (404, terminal). - `agent-chat.ts` route handler maps the new CONNECT_TIMEOUT code to HTTP 504 alongside the legacy DHT_TIMEOUT, so the UI's existing retry logic works without changes. The deeper 503 (DHT_UNAVAILABLE / PEER_ROUTING_UNAVAILABLE) distinction still requires a `resolveWithDiagnostics` API on PeerResolver and is left as the documented follow-up — see RFC 07 §3.3 and the updated `connectToPeerId` docstring. Co-authored-by: Cursor <cursoragent@cursor.com>
…TP semantics
The existing devnet-test.sh covers SWM/VM, publishing, gossipsub, sub-graphs,
conversation turns and free-CG flows across 27 sections (~2100 lines), but
none of them exercise the RFC 07 /api/connect resolver path or its HTTP
mapping. SECTION 28 closes that gap with seven scenarios:
28a INVALID_PEER_ID → HTTP 400 + code=INVALID_PEER_ID
28b SELF_DIAL → HTTP 400 + code=SELF_DIAL
28c PEER_NOT_FOUND
OR CONNECT_TIMEOUT → HTTP 404 OR HTTP 504
A fresh Ed25519 peerId is generated via libp2p so the format always
parses on the installed version. Both 404 (resolver completes
cleanly) and 504 (signal aborts, RFC 07 PR #499 round 5) are
acceptable; anything else is a regression in the route's switch.
28d Cold-peer dial via PeerResolver — disconnect Node3 from Node1's
perspective, then /api/connect by peerId only. Asserts the
resolver primes peerStore so dialProtocol succeeds without the
caller passing a multiaddr.
28e Idempotent fast-path — re-call /api/connect on the just-connected
peer, asserts <= 3s response (resolver step-1 live-conn check).
28f Empty body → HTTP 400 (route-level guard)
28g Audit gate — runs audit-dial-protocol.mjs locally so a regression
that re-introduces a raw libp2p.dialProtocol(peerId, ...) outside
the allowlist surfaces as a devnet-side smoke test, not just at
PR-merge time in CI.
Verified live against a 6-node devnet (4 core + 2 edge): all 7 PASS.
The §28c run produced HTTP 504 + CONNECT_TIMEOUT in 15s with a freshly-
generated ghost peerId — concrete proof that the round-5 transient-vs-
terminal split works end-to-end through the HTTP boundary, not just in
the unit test.
Co-authored-by: Cursor <cursoragent@cursor.com>
…art resilience (PR #499 follow-up) Builds on SECTION 28 with three additional RFC 07 coverage scenarios plus ergonomic wins for the test harness: §28h — Legacy /api/connect {multiaddr} form. Pins the non-resolver branch in agent-chat.ts so a regression that breaks the legacy direct-dial path while editing the resolver branch can't slip past CI. Multiaddr extracted from daemon log (filtered by node's own peerId to skip the bootstrap-relay line). §29 — Cross-node /api/connect matrix. Every node /api/connect's to every other node by peerId alone (N=6 → 30 pairs). Verifies the resolver wiring is uniformly installed across the whole cluster — catches "one node missing the resolver wiring" init-order bugs that §28d (Node1 → Node3 only) can't see. Honest about the mDNS warm-cache caveat in the section preamble; emits [SLOW] markers for any pair that takes > 2s as a perf-regression canary. §30 — Restart resilience. Kills Node1, restarts it with the same DKG_HOME, waits for /api/status to come back, then dials Node3's peerId from the freshly-restarted Node1. Verifies the resolver+API survive a full process restart end-to-end. Honest scope note in the preamble: on loopback devnet, mDNS reconverges fast enough that this is closer to a "restart-and-still-works" smoke test than a true cold-path test of the deeper resolver steps. The unit tests (peer-resolver.test.ts) cover the cold-path step walk in isolation; this section's unique value is end-to-end process restart. Destructive — gated by SKIP_RESTART=1 and runs LAST. Harness ergonomics: - SCRIPT_T0 + total-elapsed line in TEST SUMMARY (e.g. "Elapsed: 215s") so a "this used to take 3 minutes" regression is visible. - section_start / section_done helpers wired into §28-30 (per-section timing). Existing 27 sections aren't retrofitted — would need wrapping each in a function. - SKIP_RESTART=1 / SKIP_MATRIX=1 env vars for fast iteration. - RESTART_BOOT_TIMEOUT_S=60 default for §30d. Bash 3 compat: macOS ships bash 3.2, so §29 uses a sparse indexed array (PEERID_BY_PORT[$port]=...) instead of `declare -A` — same effect since port numbers are integers. Tested on 6-node devnet (PR #499 branch + this commit on top): §28a-h: 8/8 PASS §29 : 30/30 pairs reachable, 0 slow, 0 failed §30a-f: 6/6 PASS, restart took 5s, mesh restored to 6 peers Co-authored-by: Cursor <cursoragent@cursor.com>
… gate; pin §28c/§30 (Codex PR #499 round 5) Three Codex review items addressed: 1. dkg-agent.ts dial-phase abort detection (Codex line 4096) The shared AbortSignal now covers BOTH resolution and dial. If most of the budget went into resolve() and dial() then aborts on the same signal, we previously misclassified as DIAL_FAILED (502, transport failure). Now we detect signal.aborted / AbortError / ABORT_ERR in the dial-phase catch and rethrow as CONNECT_TIMEOUT (504, retriable) — preserving the timeout-vs-transport-failure distinction across the entire connect path. 2. audit-dial-protocol.mjs aliasing/destructuring bypass (Codex line 170) Earlier rounds caught CALL forms (member, optional-chain, bracket, optional-call). But aliasing patterns leave no `dialProtocol(` token at the call site: const fn = libp2p.dialProtocol.bind(libp2p); const { dialProtocol } = libp2p; Round 4 widens the detector to flag any occurrence of the bare `dialProtocol` IDENTIFIER token in stripped (comments/strings blanked) text — single regex `\bdialProtocol\b` covers every call/alias/destructure form. Bracket-string access (`["dialProtocol"]`) still needs the separate BRACKET_ACCESS_RE because the stripper blanks string contents. Allowlist updated for the broader detection: definition sites count too, so: - network.ts: 1 (interface declaration) - libp2p-network.ts: 2 (impl method definition + libp2p call) - protocol-router.ts: 1 (the resolver-consulting call) — unchanged 33/33 audit-dial-protocol.test.mjs pass (8 new tests for aliasing, destructuring, function-arg destructuring, bare reference, word-boundary protection of unrelated identifiers). 3. devnet-test.sh §28c (line 2202) + §30 (line 2502) §28c was accepting BOTH 404 and 504 for the ghost peer — a regression that turned every clean miss into 504 would have passed. On this 6-node devnet, the deterministic outcome is 504 (DHT walk keeps trying for the full timeoutMs budget for a non-existent peer ID). Pinned to 504 only; 404 case is covered by the unit test `peer-resolver.test.ts: returns empty array when nothing resolves`. §30e was downgrading CONNECT_TIMEOUT/PEER_NOT_FOUND to warn, meaning the test silently passed when post-restart connectivity was broken — defeating the section's purpose. Strict mode: retry /api/connect every 1s for up to 30s (RESTART_RETRY_BUDGET_S override), pass on first 200, FAIL hard if it never recovers. This absorbs the legitimate mDNS/DHT warmup window without going silent on real regressions. Co-authored-by: Cursor <cursoragent@cursor.com>
… PR-5)
Last of the RFC 07 stack. Replaces the upstream gossipsub msgIdFn
(which uses key+seqno for signed messages, sha256(data) for unsigned)
with a single content-deterministic constant the network commits to:
msgId(topic, payload, fromIdentityId)
:= sha256(topic ‖ payload ‖ fromIdentityId)
fromIdentityId = signed → from.toMultihash().bytes
unsigned → ∅ (empty)
Why now, behaviour-invisibly: v1 ships only one gossip backend
(LibP2PGossipBackend), so msgIds change but no consumer compares them
across implementations today. The point of locking in the constant
*before* a second backend exists (iroh-gossip, anything else RFC 07
might license) is that the wire-format change has to land coordinated
across the entire network — adding a different-but-incompatible msgId
later would force a synchronised mid-flight upgrade. Doing it now is
free; doing it later is operationally painful at scale.
Changes:
- packages/core/src/network/gossip-msg-id.ts: dkgGossipMsgId function;
total over the signed/unsigned Message union; returns 32-byte sha256.
- packages/core/src/network/index.ts + src/index.ts: re-exports.
- packages/core/src/node.ts: passes msgIdFn to gossipsub() construction.
- packages/core/test/gossip-msg-id.test.ts: 8 cases pinning the encoding,
covering signed/unsigned/UTF-8/dedup-by-content/no-leak-of-seqno-or-sig.
Tests:
- core 659/659 (+8 new)
- agent 429/429 (msgId change is content-equivalent for V10 today;
no regressions across the gossip-heavy e2e suite)
- audit-dial-protocol still OK (no new dialProtocol sites)
- pnpm -r build clean
Refs: dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md §5.4
Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #501 caught two bugs in the v0 encoding that would have shipped if merged as-is: Bug 1 — length collision ------------------------ Raw concatenation `topic ‖ payload ‖ from` is not injective on the input tuple. With the same `from`, ("ab","c") and ("a","bc") hash the same bytes and dedup as one message even though they're distinct publishes. Fix: u32_be length prefix per field. Distinct (topic, payload, from) tuples now hash to distinct inputs. Bug 2 — retries broken ---------------------- Pure content-derived msgIds dedup the same peer publishing the same payload twice. Gossipsub uses msgId for cache-based dedup; without seqno in the hash, a network blip + retry produces the same msgId and the retry is silently dropped. The upstream `msgIdFnStrictSign` includes seqno precisely for this reason. Fix: include the gossipsub sequenceNumber for signed messages (u64_be suffix). Unsigned messages use 0n (V10 doesn't ship unsigned today; the branch exists so the function is total over the Message union). Final encoding pinned by tests: msgId(topic, payload, fromIdentityId, seqno) := sha256( u32_be(len(topic)) ‖ topic ‖ u32_be(len(payload)) ‖ payload ‖ u32_be(len(fromIdentityId)) ‖ fromIdentityId ‖ u64_be(seqno) ) Tests: - core 11/11 in gossip-msg-id (was 8, +3 new for the codex-bug regressions: collision, retry, seqno-determinism) - core 662/662 full suite (no other test relied on the old encoding) Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #501 (round 2) identified a real rolling-upgrade hazard: Gossipsub puts msgIds into its IHAVE/IWANT control protocol, so during a rolling upgrade where some nodes still use the upstream default `msgIdFnStrictSign` and some use `dkgGossipMsgId`, the two halves of the mesh compute different IDs for the same payload and stop correlating cache entries. Push delivery still works (the payload bytes on the wire are unchanged); only the dedup cache fragments, producing extra IWANT/SERVE round-trips per message until the upgrade completes. The original RFC 07 §5.4 justification was "behaviour-invisible today, locks in the constant before a second backend exists" — the "behaviour-invisible" half was wrong. Decision: defer the actual wiring. v1 ships: - The `dkgGossipMsgId` function (length-framed, includes seqno). - 11/11 encoding tests pinning the constant. - Public re-exports so external code can reference the constant. …but does NOT flip `msgIdFn` in `node.ts` yet. The flip is staged with a coordinated mesh-wide upgrade, most likely combined with a gossipsub protocol-version bump so old/new nodes segregate into separate meshes during cutover instead of degrading a shared one. Probably bundled with the introduction of a second gossip backend (RFC 07 Phase 6), where the protocol-version bump is needed anyway. Code changes: - Removed the `msgIdFn: dkgGossipMsgId` line from `node.ts` and dropped the import. Replaced with a long inline NOTE pointing to the RFC and explaining why this is intentionally not wired. Spec changes (filed separately in dkgv10-spec repo): - Rewrote RFC 07 §5.4 to explain why wiring is deferred and what the cutover plan is. - Updated Phase 5: "post-v1, coordinated" rather than "optional, ~zero risk". Tests: - core gossip-msg-id 11/11 (unchanged — function still ships) - core build clean Co-authored-by: Cursor <cursoragent@cursor.com>
Codex review feedback on PR #501 (round 3): > This comment says the deferred wiring plan is tracked in > `dkgv10-spec/production_mainnet/07_IN_PROCESS_PEER_RESOLVER.md`, > but that path does not exist in this repo. That leaves future > follow-up with no discoverable anchor. True — the RFC 07 spec lives in a sibling repo (`dkgv10-spec`), not in this one. Pointing at a path that doesn't resolve from this repo's root would have led the next contributor on a wild goose chase. Fix: reorganise the comment so the in-repo cross-references come first (the function lives at packages/core/src/network/gossip-msg-id.ts with tests at packages/core/test/gossip-msg-id.test.ts — both discoverable from this repo today), and explicitly call out that the spec doc is in the sibling `dkgv10-spec` repo. The summary above the cross-references already covers the "why" so the wiring decision is reviewable purely in-place. Co-authored-by: Cursor <cursoragent@cursor.com>
…501) Codex review feedback on PR #501 (round 4): > For type: 'unsigned' this falls back to fromBytes = [] and seqno = > 0n, so two different unsigned publishers sending the same payload > on the same topic will produce the same msgId and one publish will > be falsely deduplicated. Because this helper is now exported > publicly, consumers can hit that collision even before core wires > it in. Real bug at the API surface level. The earlier draft handled unsigned by zero-filling the publisher and seqno fields, which preserves the upstream-default false-dedup property — fine if nothing ever calls it with unsigned messages, but a footgun for external consumers of the exported function. V10 configures gossipsub with the StrictSign default, so unsigned messages don't appear in the wild today. Throwing here makes the "unsigned not supported in this msgId scheme" stance explicit: - any code path that tries to publish an unsigned message fails loudly (easy to debug), - external consumers of the exported function can't accidentally hit the false-dedup case, - a future PR that wants to support unsigned has to deliberately extend the scheme with a per-message identity (nonce / hash prefix / etc.), pinned by tests. Implementation: - New `DkgGossipUnsignedMessageError` class (also exported) with a message that explains the constraint and points at the workarounds. - `dkgGossipMsgId` throws the error for `msg.type !== 'signed'` before building the hash input. - Updated tests: replaced the "unsigned: length-framed sha256 with seqno=0n and empty from" case with "unsigned: throws DkgGossipUnsignedMessageError"; dropped the now-impossible "signed and unsigned with same topic/payload differ" case. Tests: - core gossip-msg-id 10/10 (was 11; net -1 because the signed-vs-unsigned diff test became unreachable) Co-authored-by: Cursor <cursoragent@cursor.com>
…dex PR #501 round 5) Codex review feedback flagged that the round-4 signature accepted a libp2p `Message` directly, which made the "cross-backend dedup" framing aspirational rather than concrete. A future iroh-gossip backend would need to either duplicate the framing logic or shoehorn its messages through the libp2p shape. Split: - `dkgGossipMsgIdRaw({ topic, data, publisherIdBytes, sequenceNumber })` is the backend-agnostic primitive; framing + sha256 lives here once. - `dkgGossipMsgId(msg: libp2p.Message)` is now a thin adapter that unwraps libp2p shape into raw inputs and enforces signed-only. Tests: - Adds a FIXED VECTOR test that pins the exact 32-byte sha256 output for a known input, addressing Codex feedback that the existing `expected()` helper mirrors production logic and would mask encoding drift if both were mutated together. - Adds direct tests for `dkgGossipMsgIdRaw` (length-framing, seqno-in-hash, 32-byte digest). - Adds an adapter↔raw equivalence test proving the libp2p adapter delegates to the primitive on every signed message. Marks the public API `@experimental` in JSDoc — the function is intentionally unwired in v1 (see RFC 07 §5.4 + Phase 5 deferred cutover) and downstream consumers shouldn't rely on the in-process mesh routing through it yet. Co-authored-by: Cursor <cursoragent@cursor.com>
…ex PR #501 round 6) The libp2p adapter (`dkgGossipMsgId`) rejects unsigned messages because they have no publisher identity, but the raw primitive (`dkgGossipMsgIdRaw`) used to silently accept the equivalent `publisherIdBytes.length === 0` case — reopening the exact false-dedup hazard the adapter was built to close. Any future backend or consumer that forgot to plumb publisher-identity bytes through the raw primitive would have shipped a quiet correctness bug at the cross-backend boundary the primitive was meant to lock down. Add `DkgGossipMissingPublisherError` (exported from network/index + core/index) and throw it from `dkgGossipMsgIdRaw` on empty input. Updated the "returns 32-byte SHA256" test to use a non-empty publisher (the empty-publisher case is now an error, not a vector). Added a regression test pinning the throw. 17/17 gossip-msg-id tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>
| return; | ||
| } | ||
|
|
||
| if (opts?.relayCapable === true) { |
There was a problem hiding this comment.
Leaving false/absent as a no-op makes the on-chain flag sticky. The daemon calls this at startup with relayCapable: config.relayCapable === true, so a core that once ran with relayCapable: true and then disables relay service keeps ProfileStorage.relayCapable == true forever unless an operator manually flips it. That stale registry entry means future resolver/attestation logic can continue treating an opted-out node as a relay candidate. Since this method is documented as syncing the intended relayCapable flag, please compare current with the desired value and call setRelayCapable(false) when config is explicitly false (or preserve undefined vs false at the lifecycle call if manual flips must remain untouched).
There was a problem hiding this comment.
Good catch — fixed in a4eb9df.
publishRelayRegistry now has explicit tri-state semantics:
true→ ensure on-chain flag is true (idempotent)false→ ensure on-chain flag is false (actively clears stale prior opt-in)undefined→ no-op (preserves manualdkg admin set-relay-capableflips)
The lifecycle call site (packages/cli/src/daemon/lifecycle.ts:772) was the bigger problem — it was coercing config.relayCapable === true, destroying the undefined-vs-false distinction the resolver method was trying to support. It now passes config.relayCapable directly.
Updated:
packages/agent/src/dkg-agent.ts— branch on bothtrueandfalse, anything else (including non-boolean misconfig) is treated as "no opinion"packages/cli/src/daemon/lifecycle.ts— drop=== truecoercion, pass throughpackages/cli/src/config.ts— docstring updated to document tri-statepackages/agent/test/publish-relay-registry.test.ts— 12 regression tests (fresh-flip, idempotent both directions, false-clears-stale, undefined-no-op, opts omitted, non-boolean ignored, no chain support, no profile, getIdentityId throws, no getRelayCapable, setRelayCapable rejects)
All 440 agent tests pass; both cli and agent packages typecheck clean.
…dex PR #506) The previous version only acted on `opts.relayCapable === true`, treating both `false` and `undefined` as a no-op. The lifecycle call site coerced `config.relayCapable` to `=== true`, so a node that had ever started with `relayCapable: true` would keep advertising relay capability on-chain forever — even after the operator removed the flag from config. Stale opt-ins meant future resolver / attestation logic could keep treating opted-out nodes as relay candidates. Tri-state semantics now (explicit per Codex review): - `true` → ensure on-chain flag is true (idempotent, flips if false) - `false` → ensure on-chain flag is false (idempotent, flips if true, actively clearing stale prior opt-in) - `undefined` → leave on-chain alone, preserving any manual `dkg admin set-relay-capable` flips Changes: - `publishRelayRegistry` (dkg-agent.ts) checks both true and false explicitly; rejects non-booleans as "no opinion". - Lifecycle now passes `config.relayCapable` directly without `=== true` coercion, so the agent sees the operator's true/false/undefined intent. - Config docstring updated to document the tri-state. - 12 new regression tests in `publish-relay-registry.test.ts` pinning every branch (fresh true, idempotent true, false clears stale, idempotent false, undefined no-op, opts omitted, non-boolean ignored, no chain support, no profile, getIdentityId throws, no getRelayCapable, setRelayCapable rejects). Co-authored-by: Cursor <cursoragent@cursor.com>
Lockstep bumps every workspace package from 10.0.0-rc.6 → 10.0.0-rc.7 so a `pnpm install dkg-v10@10.0.0-rc.7` resolves to a coherent set. `network/testnet.json#chainResetMarker` is intentionally NOT bumped — rc.7 ships node-only changes (no contract redeploy), so the existing `v10-rc6-pca-author-attestation-2026-05-10` marker is correct and on-chain state must NOT be wiped on first boot. CHANGELOG entry summarises rc.7 contents: - RFC 07 in-process PeerResolver (consolidated #494/#496/#497/#499/ #501/#506) — every libp2p dial flows through one resolution chain; `/api/connect` returns precise HTTP codes (504/502/400) with structured error bodies; CI grep gate prevents bypass. - Backend-agnostic dkgGossipMsgId primitive (#501) — unified gossip message-id derivation across real libp2p and in-memory test bus. - devnet-test.sh sections 28-31 (#501/#506/#507) — resolver path, cross-node matrix, cold-restart resilience, full invite/join end-to-end. Knobs: SKIP_RESTART, SKIP_MATRIX, SKIP_INVITE_FLOW. - PCA wiring at registration (#502, rebase of #423) — PCA agents can publish to curated CGs without 500 fallthroughs; non-existent PCA accounts → clean 404; publishPolicy exposed on ApiClient.createContextGraph for dkg-node-ui. - relayCapable tri-state on chain (#506) — true ensures, false actively clears, undefined leaves untouched (preserves manual `dkg admin set-relay-capable` flips). 12 regression tests added. - SECTION 31 invite/join e2e in devnet smoke suite (#507) — asserts DKG_CURATOR triple, accept+persist log path, denied-without-phantom semantics, and post-approval _meta mirroring. Co-authored-by: Cursor <cursoragent@cursor.com>
What this PR is
A consolidation of two stacked feature trains that landed correctly on their feature branches but never reached
main:feat/461-relay-registry.base) rather than cascading down tofeat/461-relay-registry. Net result: only PR core(network): add Network interface + LibP2PNetwork wrapper (RFC 07 PR-1) #494 made it ontofeat/461-relay-registry; the other four (core(network): add PeerResolver + migrate Messenger (RFC 07 PR-2) #496-501) ended up stranded on their dead-end intermediates.This PR points the tip of that work —
feat/rfc07-pr5-gossipsub-msgid— directly atmain, which is the only branch that contains every commit from the entire intended train.Supersedes / closes
What's bundled
Networkinterface +LibP2PNetworkwrapper (incl. Codex round-1 fix:runOnLimitedConnection+ handler async/abort)PeerResolver+ Messenger migration (incl. round-6 prime-then-append fix + abort-listener race fix)ProtocolRouterconsultsPeerResolver/api/connectmigration + CI grep gate (incl. round-5 fixes: dial-phase abort detection, audit aliasing/destructuring, devnet §28c/§30 strict mode)dkgGossipMsgId+dkgGossipMsgIdRaw(incl. round-6 empty-publisher rejection)Stats
mainTest state on tip
libp2p-network+protocol-router+protocol-router-resolver+peer-resolver+gossip-msg-idaudit-dial-protocol.test.mjspassaudit-dial-protocol.mjsruns clean against the codebasescripts/devnet-test.sh) §28-30 cover RFC 07 connectivity end-to-end (most recent run: 7 RFC 07 tests PASS, 30/30 cross-node matrix, post-restart resilience strict-mode PASS)Why a new PR rather than fix-forward on #471
The cleanest alternative was to merge
feat/rfc07-pr5-gossipsub-msgidintofeat/461-relay-registryand let #471 carry it. That involves a merge commit and likely conflict resolution against the pre-Codex-fix #494 head that's already on relay-registry. Opening this PR directly is one fewer step and produces a clean linear-ish history intomain.Post-merge cleanup
Once this is merged, the following branches can be deleted:
feat/461-relay-registryfeat/rfc07-pr1-network-ifacefeat/rfc07-pr2-peer-resolverfeat/rfc07-pr3-protocol-routerfeat/rfc07-pr4-connect-and-gatefeat/rfc07-pr5-gossipsub-msgid(the head of this PR)Made with Cursor