Skip to content

Commit 776f107

Browse files
authored
Merge pull request #587 from OriginTrail/soak/messenger-rc9-everything
rc.9: Universal Messenger + reliable SWM share fan-out (merge train, 22 PRs)
2 parents 6124bbf + 4a8b716 commit 776f107

86 files changed

Lines changed: 23168 additions & 1864 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,12 +116,19 @@ For any other error (peer not found, timeout, network), retry once before bother
116116
- **Don't loop** — if you send a message and the response comes back asking another question, surface it to the operator before auto-replying. Phase 1 is operator-in-the-loop by design; Phase 3 (autonomous bridge) is a future RFC.
117117
- **Don't conflate chat with the `chat` sub-graph.** The `chat` sub-graph (captured by `capture-chat.mjs`) is the operator's conversation with you; `dkg_send_message` is *your* conversation with another agent. They're separate channels for now.
118118

119+
## Universal Messenger (v10.0.0-rc.9)
120+
121+
DKG's short peer-to-peer protocols (chat, skill request, query-remote, swm-sender-key, private-access, join-request, storage-ack, verify-proposal) all route through a single reliability substrate called the **Universal Messenger**. Architecture: [`docs/messenger.md`](./docs/messenger.md). Operator-facing surfaces: [`docs/messenger-operator.md`](./docs/messenger-operator.md). Migration recipe for a hypothetical 9th protocol: [`docs/messenger-add-protocol.md`](./docs/messenger-add-protocol.md).
122+
123+
**Convergence rule for agents working on this codebase**: route any new short-message protocol through `Messenger.sendReliable` and register handlers via `Messenger.register` — never `ProtocolRouter.send` / `ProtocolRouter.register` directly. The substrate gives you sender-side durable retry (SQLite outbox surviving daemon restart), receiver-side dedup keyed by `messageId`, sender-side response cache, stale-snapshot-safe retries, opportunistic flush on `connection:open`, DHT-walk-on-stall, and observability via `/api/slo`. Bypassing it loses every one of those properties. The migration recipe lives at `docs/messenger-add-protocol.md` and includes the worked example from PR-3 (chat).
124+
119125
## Things to NOT do
120126

121127
- **Don't fabricate URIs.** Every URI in `mentions` must come from `dkg_search` or be freshly minted via the look-before-mint protocol.
122128
- **Don't skip turns to "save tokens".** One annotation call per turn is cheap (~few hundred ms). Coverage wins.
123129
- **Don't publish to VM via MCP.** That's `dkg_request_vm_publish` (marker for human review), not `/api/shared-memory/publish`. The agent is never the gating actor for on-chain commitment.
124130
- **Don't normalise slugs in your `dkg_search` query.** Pass the unnormalised label so the daemon's fuzzy match has the most signal; only normalise when comparing for reuse-vs-mint.
131+
- **Don't call `ProtocolRouter.send` directly for new short-message protocols.** Use `Messenger.sendReliable` (see Universal Messenger section above).
125132

126133
## Cheat sheet
127134

CHANGELOG.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,43 @@ All notable changes to the DKG V9 node are documented here. The format is based
44

55
## [Unreleased]
66

7+
## [10.0.0-rc.9] - 2026-05-17
8+
9+
**Universal Messenger substrate**: every short peer-to-peer DKG protocol (chat, skill request, query-remote, swm-sender-key, private-access, join-request, storage-ack, verify-proposal) now routes through a single reliability layer with at-least-once delivery, receiver-side dedup keyed by `messageId`, sender-side response cache (256 KiB inline; mark-only beyond), SQLite-persisted retry outbox surviving daemon restart, opportunistic flush on `connection:open`, DHT-walk-on-stall recovery, and per-protocol latency observability via `/api/slo`. Architecture in [`docs/messenger.md`](./docs/messenger.md); operator surfaces in [`docs/messenger-operator.md`](./docs/messenger-operator.md); migration recipe in [`docs/messenger-add-protocol.md`](./docs/messenger-add-protocol.md).
10+
11+
**Wire-format break**: all 8 substrate-routed protocols moved from `/dkg/10.0.0/*` to `/dkg/10.0.1/*`. Both daemons in a pair must be on rc.9 for any of these protocols to negotiate between them; mixed-pair deploys (one node rc.8, one rc.9) surface as `delivered: false, queued: true` outbox entries that drain once both sides upgrade. No backward-compatibility codepath ships — hard cutover keeps the substrate's correctness proofs simple. `/dkg/10.0.0/verify-approval` is the sole `/dkg/10.0.0/*` survivor (not a substrate caller; left bare). See the upgrade order in `docs/messenger-operator.md` § "Upgrade from rc.8 to rc.9".
12+
13+
**V12 + V13 SQLite migrations** run automatically on first rc.9 boot. V12 adds `message_idempotency` + `protocol_outbox` tables (additive — chat continues to write to its V11 column until V13 cuts over). V13 drops the V11 `idx_chat_msgid` partial unique index in favour of the substrate-owned dedup. `chat_messages.message_id` column is **preserved nullable** for hot-rollback safety: rc.8 finds a column it recognises if you have to downgrade. In-flight rc.8 chat-outbox entries should be drained before upgrade (let the daemon idle for one tick cycle, typically 30s); new sends post-upgrade route via the substrate outbox.
14+
15+
### Added — Universal Messenger substrate
16+
17+
- **PR-1 (#542): substrate primitives** (`packages/core/src/proto/reliable-envelope.ts`, `packages/core/src/messenger-types.ts`, `packages/core/src/protocol-outbox.ts`, `packages/node-ui/src/db.ts`, `docs/messenger.md`): introduces the `ReliableEnvelope` Protobuf wire wrapper (`{ messageId, version, tsMs, payload }`), the `MessageIdempotencyStore` + `ProtocolOutboxStore` ports, the generic `ProtocolOutbox` retry helper (5s → 15s → 30s → 60s → 5m → 30m → 2h ladder; per-key inflight guard; stale-snapshot guard lifted from rc.8 #538), and SQLite-backed `SqliteMessageIdempotencyStore` + `SqliteProtocolOutboxStore` against a V12 schema migration. 256 KiB inline response cache budget; oversize responses stored mark-only and surface `RESPONSE_GONE` to duplicate receivers.
18+
- **PR-2 (#543): Messenger evolution + lifecycle wiring** (`packages/agent/src/p2p/messenger.ts`, `packages/agent/src/dkg-agent.ts`, `packages/cli/src/daemon/lifecycle.ts`): the `Messenger` class gains `sendReliable` + `register` substrate surfaces wrapping envelope encode + sender/receiver idempotency + outbox enqueue on recoverable failure; legacy `sendToPeer` path preserved bitwise-compatible for any `/dkg/10.0.0/*` caller. `DKGAgent` wires a `messengerOutboxTimer` periodic tick and piggy-backs `messenger.processOutboxOnConnect` onto its existing `connection:open` handler. `lifecycle.ts` instantiates the SQLite stores against the shared `DashboardDB` and routes `ackTransportFactory.sendP2P` through the Messenger (semantics-identical until `/storage-ack` migrates in PR-11).
19+
- **PR-3 (#544): pilot migration — chat + skill onto `/dkg/10.0.1/message`** (`packages/core/src/constants.ts`, `packages/agent/src/messaging.ts`, `packages/agent/src/dkg-agent.ts`, `packages/node-ui/src/db.ts`, `docs/messenger-add-protocol.md`): `PROTOCOL_MESSAGE` prefix bumped from `/dkg/10.0.0/message` to `/dkg/10.0.1/message`. `MessageHandler.sendChat` / `sendSkillRequest` route through `messenger.sendReliable`; the in-process `MessageOutbox` chat-specific queue + its periodic tick + opportunistic-flush + stale-snapshot guard are **deleted** in favour of the substrate's generic `ProtocolOutbox`. V13 SQLite migration drops the V11 `idx_chat_msgid` partial unique index (receiver-side dedup now owned by the substrate's `message_idempotency` table). MCP `dkg_send_message` queued/attempts/nextAttemptAtMs operator surface preserved end-to-end — sourced from the substrate outbox.
20+
- **PR-4 (#545): `ProtocolRouter.send` parallelPaths option** (`packages/core/src/protocol-router.ts`, `docs/messenger.md`): `send()` accepts a `SendOptions { parallelPaths?: number, timeoutMs?: number }` object (the legacy `timeoutMs: number` arg is still accepted for backward compat). When `parallelPaths > 1`, opens N concurrent `newStream` attempts across enumerated live connections via `Promise.any`; first success wins, losers aborted. Safe only on `/dkg/10.0.1/*` where receiver dedup is mandatory. Defaults to 1 for app-level fan-out protocols (storage-ack / verify-proposal); chat opts in at 2.
21+
- **PR-5 (#546): DHT-walk-on-outbox-stall recovery** (`packages/agent/src/p2p/messenger.ts`, `packages/agent/src/dkg-agent.ts`, `docs/messenger.md`): outbox entries that reach `OUTBOX_STALL_THRESHOLD = 5` attempts of "no valid addresses for peer" trigger a time-bounded (`DHT_WALK_TIMEOUT_MS = 10s`), rate-limited (`DHT_WALK_RATE_LIMIT_MS = 5min/peer`) `libp2p.peerRouting.findPeer()` to refresh the receiver's addresses in the peerStore. Runs in the background so retries continue uninterrupted; the next outbox tick re-enumerates against the refreshed addresses.
22+
- **PR-7 (#548): `--relay-preferred` CLI flag + `preferredRelays` config + relay-setup playbook** (`packages/cli/src/cli.ts`, `packages/cli/src/config.ts`, `packages/cli/src/daemon/lifecycle.ts`, `packages/cli/README.md`, `docs/messenger-operator.md`): operators can prioritise relays they control via `dkg start --relay-preferred /ip4/.../p2p/...` (repeatable) or by writing `preferredRelays: string[]` into `~/.dkg/config.json`. The new `mergePreferredRelays` helper parses both sources, dedupes (first-seen order), and prepends to the network relay list — public testnet relays remain as fallback. Operator playbook for standing up your own relay infrastructure ships in `packages/cli/README.md` § "Operator relays". PR-6 (gossip peer-hints) cancelled per Gate B; DHT walk + inbound-from-receiver are sufficient.
23+
- **PR-8 (#550): migrate `/swm-sender-key` + `/private-access` onto the substrate** (`packages/core/src/constants.ts`, `packages/agent/src/dkg-agent.ts`, `packages/publisher/src/access-client.ts`, `packages/publisher/test/_helpers/substrate.ts`, `docs/messenger.md`): both protocols bumped to `/dkg/10.0.1/*` and routed through `messenger.register` / `messenger.sendReliable`. `AccessClient` is refactored to accept a minimal `AccessSendSurface` interface (defined locally in `access-client.ts`) instead of importing `Messenger` directly, avoiding a `publisher → agent` circular dependency. A test-only `publisher/test/_helpers/substrate.ts` shim provides substrate semantics for publisher tests without the agent package dependency.
24+
- **PR-9 (#551): migrate `/query-remote` onto the substrate with `RESPONSE_GONE` retry recipe** (`packages/core/src/constants.ts`, `packages/agent/src/dkg-agent.ts`, `docs/messenger.md`): protocol bumped to `/dkg/10.0.1/query-remote`. New `sendQueryReliable()` helper wraps `messenger.sendReliable` and re-issues with a fresh `messageId` if the previous attempt returns `RESPONSE_GONE` (cap 2 attempts). SPARQL queries are app-layer idempotent, so the fresh-`messageId` re-issue is safe; the cap prevents infinite loops if every response is over the 256 KiB cache budget.
25+
- **PR-10 (#554): migrate `/join-request` onto the substrate; delete `JoinApprovalRetryQueue`** (`packages/core/src/constants.ts`, `packages/agent/src/dkg-agent.ts`, `packages/agent/src/join-approval-retry-queue.ts`, `docs/messenger.md`): all three `messenger.sendToPeer` call sites for `/dkg/10.0.0/join-request` (private notification, curator-targeted forward, broadcast) replaced with `messenger.sendReliable`; protocol bumped to `/dkg/10.0.1/join-request`. The in-memory `JoinApprovalRetryQueue` + its 30s timer + on-connect handler + processor methods are **deleted** — substrate's SQLite outbox now owns retry persistence across restart. Net `-554 / +136` LOC. `listPendingJoinApprovalRetries` stubbed to return `[]` (operator diagnostic re-built atop substrate outbox in PR-12).
26+
- **PR-11 (#555): migrate `/storage-ack` + `/verify-proposal` onto the substrate** (`packages/core/src/constants.ts`, `packages/agent/src/dkg-agent.ts`, `packages/cli/src/daemon/lifecycle.ts`, `docs/messenger.md`): both protocols bumped to `/dkg/10.0.1/*` and routed via `messenger.register` / `messenger.sendReliable`. `ACKCollector` + `VerifyCollector` quorum logic is unchanged; only the transport rewires. Three `sendP2P` wirings swap `sendToPeer → sendReliable` with `queued`-as-per-peer-throw so the collectors' existing retry loops keep their semantics. `parallelPaths` stays at 1 to avoid 9x fan-out amplification on top of the existing app-layer quorum. `/dkg/10.0.0/verify-approval` intentionally stays bare (not a substrate caller).
27+
- **PR-12 (#557): per-message latency histogram + `/api/slo` endpoint + soak script extension** (`packages/agent/src/p2p/messenger.ts`, `packages/agent/src/dkg-agent.ts`, `packages/cli/src/daemon/routes/agent-chat.ts`, `scripts/libp2p-soak-test.sh`, `docs/messenger.md`, `docs/messenger-operator.md`): in-memory per-protocol histogram of `sendReliable` invoke → `delivered: true` latency across queue + retries (1000-sample sliding window via `DEFAULT_SLO_WINDOW_SAMPLES`). Exposes `{ samples, p50Ms, p95Ms, p99Ms, delivered, queued }` per protocol via `GET /api/slo` (localhost-only by default; same `Authorization: Bearer` requirement as every other `/api/*` route). Soak script extended with per-cycle `/api/slo` snapshot (`slo.jsonl` + human-readable summary line in `main.log`) covering all 8 protocols. Source of truth for the ship-gate SLO measurement.
28+
- **PR-13: docs consolidation pass** (`docs/messenger.md`, `docs/messenger-operator.md`, `AGENTS.md`, `CHANGELOG.md`): sequence diagrams (topology + happy-path + recovery + multi-path) polished into `docs/messenger.md`; per-protocol coverage table marked all-rc.9-shipped; SLO reading guide folded into `docs/messenger-operator.md`; `AGENTS.md` gets a Universal Messenger paragraph + convergence rule (route new short-message protocols via `Messenger.sendReliable`, never `ProtocolRouter.send` directly); this CHANGELOG entry.
29+
30+
### Cancelled
31+
32+
- **PR-6 (gossip peer-hints, conditional)**: skipped per Gate B decision — DHT walk + inbound-from-receiver are sufficient for the rc.9 SLO target. If a post-ship soak surfaces a reliability tail that DHT walk doesn't close, PR-6 lands as a fast follow-up under the original gossip-hints design (signed topic, 5min publish cadence, 10min replay window).
33+
34+
### Carry-over from rc.8 (preserved by the substrate)
35+
36+
The substrate explicitly preserves five rc.8 invariants surfaced by PRs #533 / #534 / #536 / #537 / #538:
37+
38+
1. **Receiver-side dedup contract** (rc.8 #534): the generic `Messenger.register` wrapper performs `(peer, protocol, message_id, 'in')` lookup before invoking the handler — exactly-once application semantics.
39+
2. **Targeted `ON CONFLICT`** (rc.8 #534): `message_idempotency` uses `INSERT ... ON CONFLICT DO NOTHING` only on the dedup case, not as a swallow-all-constraint clause.
40+
3. **Stale-snapshot guard** (rc.8 #538): `ProtocolOutbox.processOutboxOnConnect` and `processOutboxTick` call `hasEntry()` between `tryBeginAttempt()` and the wire send. Contract test pinned in `messenger-substrate.test.ts`.
41+
4. **Peer ID normalisation** (rc.8 #533): every libp2p-boundary code path uses `peerId.toString()` consistently for outbox / idempotency / diagnostics lookups.
42+
5. **Connection reuse logic** (rc.8 #537): PR-4 multi-path enumeration walks `getConnections()` directly rather than peerId-keying.
43+
744
---
845

946
## [10.0.0-rc.8] - 2026-05-15

0 commit comments

Comments
 (0)