Durable federation relay for room-service cross-site events#410
Durable federation relay for room-service cross-site events#410hmchangw wants to merge 2 commits into
Conversation
|
Warning Review limit reached
Next review available in: 52 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughThe PR adds a ROOMS-stream federation relay for cross-site room-service events, updates room-worker to consume and forward relay events, refreshes client-facing docs and implementation notes, and adds a research document on dependency instability. ChangesRoom Federation Relay
Dependency Instability Research
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
docs/research/dependency-instability-impact.md (2)
54-60: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd blank lines around table.
Surround the table with blank lines to satisfy markdownlint MD058.
| Cross-site event | Publisher | Origin context | Implicit outbox? | |---|---|---|---| + | message persist / thread-subscription | `message-worker` | consumes `MESSAGES_CANONICAL` (JS) | ✅ yes (Nak → redeliver) |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/research/dependency-instability-impact.md` around lines 54 - 60, The Markdown table in dependency-instability-impact.md needs blank lines before and after it to satisfy MD058. Update the surrounding prose near the cross-site event table so the table is separated from adjacent text by empty lines, keeping the existing table content unchanged.
52-52: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueFix heading hierarchy.
The
h4heading "Federation publisher map" follows anh2without an interveningh3. Change toh3to satisfy markdownlint MD001.- #### Federation publisher map (who has an outbox) + ### Federation publisher map (who has an outbox)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/research/dependency-instability-impact.md` at line 52, The heading hierarchy is inconsistent because the “Federation publisher map” section is using a lower-level heading directly under an h2 without an intervening h3. Update that heading in the markdown so it uses h3 instead of h4, keeping the surrounding section structure in the dependency-instability document aligned with markdownlint MD001.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@room-service/handler.go`:
- Around line 751-752: The new federation-target failure paths in the
room-service handler are returning bare errors, so update the affected handlers
to wrap `buildFederationTarget` failures with handler-specific context instead
of returning `err` directly. Add descriptive context that identifies the
federation target/action that failed for each path, following the existing
pattern in the room handler methods that cover `subscription_read`,
`thread_read`, `room_restricted`, `mute-toggled`, and `favorite-toggled`, so
logs clearly show which target caused the error.
- Around line 1977-1986: The room-restricted InboxEvent envelope is generating a
second timestamp that can drift from the payload and origin write time. In the
room service handler, update the federation flow around buildFederationTarget
and the InboxEvent creation for room_restricted so the outer envelope uses
req.Timestamp instead of calling time.Now().UTC().UnixMilli(). Keep the shared
timestamp convention consistent with RoomRestrictedInboxPayload.Timestamp and
the high-water-mark guard logic.
In `@room-worker/handler_test.go`:
- Around line 5361-5365: The test setup in the RoomFederationEvent fixture is
ignoring json.Marshal errors, which can hide broken test data and mislead
failure classification. In the handler_test.go setup around the model.InboxEvent
and model.RoomFederationEvent marshals, capture both errors and assert them with
require.NoError so fixture creation fails loudly. Use the existing test helpers
and keep the marshaling logic intact while removing the silent discard of
errors.
In `@room-worker/handler.go`:
- Around line 324-333: Reject federation targets missing eventType or dedupId
before calling h.publish in handler.go. Update the validation in the evt.Targets
loop to treat empty t.EventType and t.DedupID as invalid alongside the existing
DestSiteID and Envelope checks, and log the skip with enough context to identify
the bad target. This keeps subject.InboxExternal and h.publish from receiving
malformed inputs that would bypass the durable NATS path or route to the wrong
subject.
In `@room-worker/integration_test.go`:
- Around line 2031-2065: Create the destination INBOX stream before exercising
the federation publish path. In the integration test around the
`stream.Rooms(siteID)` setup and `processFederation` publish closure, also
create the `stream.Inbox(destSiteID)` JetStream stream so
`subject.InboxExternal(destSiteID, ...)` has a matching destination. Keep the
existing `js.CreateOrUpdateStream` pattern and ensure the stream is bound to the
inbox subject family used by the `publish`/`js.PublishMsg` path.
In `@room-worker/main.go`:
- Around line 179-190: Fail fast on invalid worker configuration by validating
cfg.MaxWorkers before startConsumer uses it. Add an early check in the main
startup/config path so MAX_WORKERS must be greater than zero, and return a clear
error instead of continuing into PullMaxMessages or make(chan struct{},
cfg.MaxWorkers). Use the existing cfg.MaxWorkers and startConsumer path to
locate the fix.
---
Nitpick comments:
In `@docs/research/dependency-instability-impact.md`:
- Around line 54-60: The Markdown table in dependency-instability-impact.md
needs blank lines before and after it to satisfy MD058. Update the surrounding
prose near the cross-site event table so the table is separated from adjacent
text by empty lines, keeping the existing table content unchanged.
- Line 52: The heading hierarchy is inconsistent because the “Federation
publisher map” section is using a lower-level heading directly under an h2
without an intervening h3. Update that heading in the markdown so it uses h3
instead of h4, keeping the surrounding section structure in the
dependency-instability document aligned with markdownlint MD001.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7f556c96-12cf-426d-a915-eb150107d0fa
📒 Files selected for processing (14)
docs/client-api.mddocs/research/dependency-instability-impact.mddocs/superpowers/plans/2026-06-28-room-federation-relay.mdpkg/model/event.gopkg/model/model_test.gopkg/subject/subject.gopkg/subject/subject_test.goroom-service/handler.goroom-service/handler_test.goroom-worker/consumer_config_test.goroom-worker/handler.goroom-worker/handler_test.goroom-worker/integration_test.goroom-worker/main.go
52c8a74 to
08e9aff
Compare
Research the failure-mode impact, project/release stability, and operational-reliability data for the four core infra dependencies (NATS/JetStream, MongoDB, Cassandra, Valkey), identifying the request/reply-originated cross-site federation publish as the top durability exposure, plus the implementation plan executed in the following commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WcNmcyHTmyokFh9vYm3brj
08e9aff to
1cdb6c0
Compare
…te events
Six room-service request/reply handlers (role_updated, mute/favorite
toggled, subscription_read, thread_read, room_restricted) federated
cross-site events by publishing an InboxEvent inline straight to a remote
site's INBOX across a supercluster gateway. On failure the error returned
to the client *after* the local Mongo write committed, so local and remote
diverged with no durable retry.
Replace this with a durable "federation relay": each handler keeps its
synchronous Mongo write and reply but publishes one RoomFederationEvent to
the local ROOMS stream; room-worker forwards each wrapped InboxEvent to the
destination INBOX with at-least-once retry — the source stream is the
outbox. The producer publish is local-cluster only, so a remote outage can
never block the user's RPC, and a destination-site outage delays the event
(retry-forever with escalating backoff) rather than dropping it.
- pkg/model: RoomFederationEvent + FederationTarget envelope types.
- pkg/subject: RoomCanonicalFederation builder
(chat.room.canonical.{siteID}.federation).
- room-service: federate + buildFederationTarget helpers; six handlers
converted. Wire format is byte-identical to the prior direct publishes,
so inbox-worker is unchanged.
- room-worker: processFederation forwards each target (transient error ->
Nak/redeliver, malformed -> Ack-poison), validating destSiteID/eventType/
envelope/dedupId at the boundary, each attempt bounded by a 3s fail-fast
timeout. It runs on a dedicated durable consumer + worker pool, isolated
from the membership consumer (filtered to create/member.add/member.remove/
room.rename), so an unreachable destination backs up only the federation
lane, never local membership processing. The federation lane retries a
failed forward forever with escalating backoff (5s -> 5m, MaxDeliver=-1),
so a long destination outage delays — never drops — the event. Fails fast
on non-positive MAX_WORKERS.
- docs/client-api.md: cross-site federation note for all six RPCs.
- Tests: forwarder, the two consumer configs, all six handlers (relay
envelope + byte-identical wrapped InboxEvent), a model round-trip, and an
end-to-end JetStream integration round-trip.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WcNmcyHTmyokFh9vYm3brj
1cdb6c0 to
2dbac20
Compare
Summary
This branch began as a research task on the impact of dependency instability (NATS/JetStream, MongoDB, Cassandra, Valkey) and then implements a fix for the top exposure that research surfaced.
The problem: six
room-servicerequest/reply handlers federated cross-site events by publishing amodel.InboxEventinline straight to a remote site's INBOX (chat.inbox.{dest}.external.{type}). That publish crosses a NATS supercluster gateway; if it fails, the error returns to the client after the local Mongo write already committed — so local and remote silently diverge with no durable retry.The fix — a durable "federation relay": each handler keeps its synchronous Mongo write and response, but now builds the same
InboxEventbytes, wraps them in aRoomFederationEvent, and publishes one envelope to the localROOMSstream (chat.room.canonical.{siteID}.federation).room-workerforwards each wrapped event to its destination INBOX with at-least-once retry — the source stream is the outbox. No new stream, no new service.What changed
pkg/model—RoomFederationEvent+FederationTargetenvelope types.pkg/subject—RoomCanonicalFederation(siteID)builder.room-service—federate+buildFederationTargethelpers; six handlers converted:updateRole,muteToggle,favoriteToggle,messageRead,messageThreadRead,roomRestricted.room-worker—processFederationforwards each target tochat.inbox.{dest}.external.{type}(transient error → Nak/redeliver, malformed → Ack-poison), validatingdestSiteID/eventType/envelope/dedupIdat the boundary with a 3s per-attempt fail-fast timeout. Runs on its own durable consumer + worker pool (room-worker-federation), isolated from the membership consumer.docs/client-api.md— cross-site federation note for all six RPCs.Behavior under a destination-site outage (the design goal)
ROOMSretention)The producer publishes only to the local
ROOMSstream, so a remote outage never blocks or errors the user's RPC. The federation consumer retries a failed forward forever with escalating backoff (5s → 15s → 1m → 5m,MaxDeliver=-1), so a long outage delays the event rather than dropping it.Design notes
ROOMS+room-worker— no new stream or service. The.federationsubject is not matched bynotification-worker's exact...event.member.mutedfilter.FilterSubjects=create/member.add/member.remove/room.rename, defaultMaxDeliver=5) and the federation consumer (.federation,MaxDeliver=-1+ backoff) have separate worker pools, so an unreachable destination backs up only the federation lane — never local membership processing (member add/remove/create/rename).room-servicestill marshals the sameInboxEvent;room-workerforwards those exact bytes, so the destinationinbox-workerhandlers are unchanged.$ltguards (lastSeenAt,muteUpdatedAt,favoriteUpdatedAt,rolesUpdatedAt,visibilityUpdatedAt); the stableDedupIDplus those guards make a re-forward (including a timed-out-but-actually-delivered publish) a no-op.user_status_updatedis deliberately left untouched (best-effort by design, owned byuser-service).Verification
make test(full suite, race detector): PASSmake lint: 0 issues;make sastgosec: PASS; no store-interface or mock changes (make generateis a no-op)buildFederationConsumerConfigand exercises the round-trip, so CI validates that nats-server acceptsMaxDeliver=-1+BackOff.Ops notes
room-workerdurable'sFilterSubjectsis narrowed on deploy (supported in-place on nats-server 2.10+); the newroom-worker-federationdurable is self-created at startup. No stream/IaC change.ROOMSstream survives a node loss and retains messages for at least the longest tolerated destination outage — confirmROOMS(andMESSAGES_CANONICAL/INBOX) are provisioned R3 + file storage with adequateMaxAgein IaC.Also included (working docs)
docs/research/dependency-instability-impact.md— the dependency-instability research report.docs/superpowers/plans/2026-06-28-room-federation-relay.md— the implementation plan this branch executed. Happy to drop the plan doc if you'd prefer it not ship.Test plan
ROOMS(andMESSAGES_CANONICAL/INBOX) are R3 + file storage with adequate retention in IaCroom-workerdurable filter update androom-worker-federationdurable creation succeed against the running nats-server🤖 Generated with Claude Code
https://claude.ai/code/session_01WcNmcyHTmyokFh9vYm3brj
Summary by CodeRabbit