Durable federation relay for room-service cross-site events by hmchangw · Pull Request #410 · hmchangw/chat

hmchangw · 2026-06-28T13:02:28Z

Summary

This branch began as a research task on the impact of dependency instability (NATS/JetStream, MongoDB, Cassandra, Valkey) and then implements a fix for the top exposure that research surfaced.

The problem: six room-service request/reply handlers federated cross-site events by publishing a model.InboxEvent inline straight to a remote site's INBOX (chat.inbox.{dest}.external.{type}). That publish crosses a NATS supercluster gateway; if it fails, the error returns to the client after the local Mongo write already committed — so local and remote silently diverge with no durable retry.

The fix — a durable "federation relay": each handler keeps its synchronous Mongo write and response, but now builds the same InboxEvent bytes, wraps them in a RoomFederationEvent, and publishes one envelope to the local ROOMS stream (chat.room.canonical.{siteID}.federation). room-worker forwards each wrapped event to its destination INBOX with at-least-once retry — the source stream is the outbox. No new stream, no new service.

What changed

pkg/model — RoomFederationEvent + FederationTarget envelope types.
pkg/subject — RoomCanonicalFederation(siteID) builder.
room-service — federate + buildFederationTarget helpers; six handlers converted: updateRole, muteToggle, favoriteToggle, messageRead, messageThreadRead, roomRestricted.
room-worker — processFederation forwards each target to chat.inbox.{dest}.external.{type} (transient error → Nak/redeliver, malformed → Ack-poison), validating destSiteID/eventType/envelope/dedupId at the boundary with a 3s per-attempt fail-fast timeout. Runs on its own durable consumer + worker pool (room-worker-federation), isolated from the membership consumer.
docs/client-api.md — cross-site federation note for all six RPCs.
Tests — forwarder, both consumer configs, all six handlers, a model round-trip, and an end-to-end JetStream integration round-trip.

Behavior under a destination-site outage (the design goal)

Remote unreachable for	Before (inline publish)	This PR
10 s	client error, event lost	RPC succeeds; forward retried, delivered in seconds
3 min	client error, event lost	RPC succeeds; delivered when remote recovers
1 hour	client error, event lost	RPC succeeds; delivered when remote recovers
1 day	client error, event lost	RPC succeeds; delivered when remote recovers (bounded by `ROOMS` retention)

The producer publishes only to the local ROOMS stream, so a remote outage never blocks or errors the user's RPC. The federation consumer retries a failed forward forever with escalating backoff (5s → 15s → 1m → 5m, MaxDeliver=-1), so a long outage delays the event rather than dropping it.

Design notes

Reuses ROOMS + room-worker — no new stream or service. The .federation subject is not matched by notification-worker's exact ...event.member.muted filter.
Two isolated lanes on one stream: the membership consumer (FilterSubjects = create/member.add/member.remove/room.rename, default MaxDeliver=5) and the federation consumer (.federation, MaxDeliver=-1 + backoff) have separate worker pools, so an unreachable destination backs up only the federation lane — never local membership processing (member add/remove/create/rename).
Wire format is byte-identical — room-service still marshals the same InboxEvent; room-worker forwards those exact bytes, so the destination inbox-worker handlers are unchanged.
Redelivery is safe — every destination handler is idempotent via high-water-mark $lt guards (lastSeenAt, muteUpdatedAt, favoriteUpdatedAt, rolesUpdatedAt, visibilityUpdatedAt); the stable DedupID plus those guards make a re-forward (including a timed-out-but-actually-delivered publish) a no-op.
Behavior changes (intentional): (1) federation is now asynchronous — a cross-site gateway hiccup no longer fails the user's local RPC; (2) the five toggle/read/role events now forward with a stable dedup ID (previously empty), an idempotency improvement with no change to the event bytes.
user_status_updated is deliberately left untouched (best-effort by design, owned by user-service).

Verification

make test (full suite, race detector): PASS
make lint: 0 issues; make sast gosec: PASS; no store-interface or mock changes (make generate is a no-op)
The room-worker integration test creates the federation consumer via buildFederationConsumerConfig and exercises the round-trip, so CI validates that nats-server accepts MaxDeliver=-1 + BackOff.

Ops notes

The existing room-worker durable's FilterSubjects is narrowed on deploy (supported in-place on nats-server 2.10+); the new room-worker-federation durable is self-created at startup. No stream/IaC change.
The relay's durability assumes the source ROOMS stream survives a node loss and retains messages for at least the longest tolerated destination outage — confirm ROOMS (and MESSAGES_CANONICAL/INBOX) are provisioned R3 + file storage with adequate MaxAge in IaC.

Also included (working docs)

docs/research/dependency-instability-impact.md — the dependency-instability research report.
docs/superpowers/plans/2026-06-28-room-federation-relay.md — the implementation plan this branch executed. Happy to drop the plan doc if you'd prefer it not ship.

Test plan

CI green (unit, lint, sast, and the room-worker integration test)
Confirm ROOMS (and MESSAGES_CANONICAL/INBOX) are R3 + file storage with adequate retention in IaC
On deploy, confirm the room-worker durable filter update and room-worker-federation durable creation succeed against the running nats-server

🤖 Generated with Claude Code

https://claude.ai/code/session_01WcNmcyHTmyokFh9vYm3brj

Summary by CodeRabbit

New Features
- Introduced a durable, asynchronous federation relay for selected room-service cross-site events (role updates, message/thread reads, mute/favorite toggles, and room restrictions).
Bug Fixes
- Improved cross-site delivery reliability by forwarding via JetStream-backed relay with at-least-once retry semantics and deduplication.
Documentation
- Updated client API documentation with clarified federation coordination behavior.
- Added new research note on dependency instability impacts and a relay implementation plan.
Tests
- Added/updated unit and end-to-end tests to validate relay wrapping, forwarding, error handling, and round-trip correctness.

coderabbitai · 2026-06-28T13:02:35Z

Warning

Review limit reached

@hmchangw, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 52 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d188a1c-a478-4dfb-b362-2ccc347e6708

📥 Commits

Reviewing files that changed from the base of the PR and between 1cdb6c0 and 2dbac20.

📒 Files selected for processing (12)

docs/client-api.md
pkg/model/event.go
pkg/model/model_test.go
pkg/subject/subject.go
pkg/subject/subject_test.go
room-service/handler.go
room-service/handler_test.go
room-worker/consumer_config_test.go
room-worker/handler.go
room-worker/handler_test.go
room-worker/integration_test.go
room-worker/main.go

📝 Walkthrough

Walkthrough

The PR adds a ROOMS-stream federation relay for cross-site room-service events, updates room-worker to consume and forward relay events, refreshes client-facing docs and implementation notes, and adds a research document on dependency instability.

Changes

Room Federation Relay

Layer / File(s)	Summary
Federation model and subject `pkg/model/event.go`, `pkg/model/model_test.go`, `pkg/subject/subject.go`, `pkg/subject/subject_test.go`	`FederationTarget` and `RoomFederationEvent` are added, with round-trip JSON coverage and a new `RoomCanonicalFederation(siteID)` subject builder.
room-service federation publish path `room-service/handler.go`, `room-service/handler_test.go`	`buildFederationTarget` and `Handler.federate` are added, and cross-site handlers are switched to publish `RoomFederationEvent` envelopes instead of direct external inbox publishes. Tests are updated to decode relay envelopes and embedded inbox payloads.
room-worker federation lane `room-worker/handler.go`, `room-worker/main.go`, `room-worker/consumer_config_test.go`, `room-worker/handler_test.go`, `room-worker/integration_test.go`	`HandleJetStreamMsg` now dispatches `.federation` subjects to `processFederation`, which forwards each `FederationTarget` to the destination inbox. `main.go` splits membership and federation consumers, and tests cover config, forwarding, error handling, and an embedded JetStream round trip.
Client API and implementation plan `docs/client-api.md`, `docs/superpowers/plans/2026-06-28-room-federation-relay.md`	The client API docs now describe ROOMS-stream federation for the affected RPCs, and the implementation plan documents the relay design and rollout tasks.

Dependency Instability Research

Layer / File(s)	Summary
Dependency instability research document `docs/research/dependency-instability-impact.md`	Adds a research note covering failure modes, release stability, operational reliability, recommendations, caveats, and sources for NATS/JetStream, MongoDB, Cassandra, and Valkey.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hmchangw/chat#168: Both PRs touch room-worker consumer configuration; this PR extends it with a separate federation lane and updated filter subjects.
hmchangw/chat#217: This PR’s federation relay path covers subscription_mute_toggled, which is directly related to the mute-toggle event flow introduced there.
hmchangw/chat#342: Both PRs modify room-service/handler.go’s messageRead flow and its related cross-site read event handling.

Suggested labels

ready

Suggested reviewers

Joey0538
mliu33

🐇 Hops through ROOMS, the relay is awake,
One envelope sent for each site to take.
DedupIDs twinkle, the inboxes hum,
And federated bunnies keep messages glum-free? No—fun!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.95% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding a durable federation relay for room-service cross-site events.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/dependency-instability-impact-bn7y1h

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

docs/research/dependency-instability-impact.md (2)

54-60: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add blank lines around table.

Surround the table with blank lines to satisfy markdownlint MD058.

  | Cross-site event | Publisher | Origin context | Implicit outbox? |
  |---|---|---|---|
+
  | message persist / thread-subscription | `message-worker` | consumes `MESSAGES_CANONICAL` (JS) | ✅ yes (Nak → redeliver) |

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/research/dependency-instability-impact.md` around lines 54 - 60, The
Markdown table in dependency-instability-impact.md needs blank lines before and
after it to satisfy MD058. Update the surrounding prose near the cross-site
event table so the table is separated from adjacent text by empty lines, keeping
the existing table content unchanged.

52-52: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Fix heading hierarchy.

The h4 heading "Federation publisher map" follows an h2 without an intervening h3. Change to h3 to satisfy markdownlint MD001.

- #### Federation publisher map (who has an outbox)
+ ### Federation publisher map (who has an outbox)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/research/dependency-instability-impact.md` at line 52, The heading
hierarchy is inconsistent because the “Federation publisher map” section is
using a lower-level heading directly under an h2 without an intervening h3.
Update that heading in the markdown so it uses h3 instead of h4, keeping the
surrounding section structure in the dependency-instability document aligned
with markdownlint MD001.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@room-service/handler.go`:
- Around line 751-752: The new federation-target failure paths in the
room-service handler are returning bare errors, so update the affected handlers
to wrap `buildFederationTarget` failures with handler-specific context instead
of returning `err` directly. Add descriptive context that identifies the
federation target/action that failed for each path, following the existing
pattern in the room handler methods that cover `subscription_read`,
`thread_read`, `room_restricted`, `mute-toggled`, and `favorite-toggled`, so
logs clearly show which target caused the error.
- Around line 1977-1986: The room-restricted InboxEvent envelope is generating a
second timestamp that can drift from the payload and origin write time. In the
room service handler, update the federation flow around buildFederationTarget
and the InboxEvent creation for room_restricted so the outer envelope uses
req.Timestamp instead of calling time.Now().UTC().UnixMilli(). Keep the shared
timestamp convention consistent with RoomRestrictedInboxPayload.Timestamp and
the high-water-mark guard logic.

In `@room-worker/handler_test.go`:
- Around line 5361-5365: The test setup in the RoomFederationEvent fixture is
ignoring json.Marshal errors, which can hide broken test data and mislead
failure classification. In the handler_test.go setup around the model.InboxEvent
and model.RoomFederationEvent marshals, capture both errors and assert them with
require.NoError so fixture creation fails loudly. Use the existing test helpers
and keep the marshaling logic intact while removing the silent discard of
errors.

In `@room-worker/handler.go`:
- Around line 324-333: Reject federation targets missing eventType or dedupId
before calling h.publish in handler.go. Update the validation in the evt.Targets
loop to treat empty t.EventType and t.DedupID as invalid alongside the existing
DestSiteID and Envelope checks, and log the skip with enough context to identify
the bad target. This keeps subject.InboxExternal and h.publish from receiving
malformed inputs that would bypass the durable NATS path or route to the wrong
subject.

In `@room-worker/integration_test.go`:
- Around line 2031-2065: Create the destination INBOX stream before exercising
the federation publish path. In the integration test around the
`stream.Rooms(siteID)` setup and `processFederation` publish closure, also
create the `stream.Inbox(destSiteID)` JetStream stream so
`subject.InboxExternal(destSiteID, ...)` has a matching destination. Keep the
existing `js.CreateOrUpdateStream` pattern and ensure the stream is bound to the
inbox subject family used by the `publish`/`js.PublishMsg` path.

In `@room-worker/main.go`:
- Around line 179-190: Fail fast on invalid worker configuration by validating
cfg.MaxWorkers before startConsumer uses it. Add an early check in the main
startup/config path so MAX_WORKERS must be greater than zero, and return a clear
error instead of continuing into PullMaxMessages or make(chan struct{},
cfg.MaxWorkers). Use the existing cfg.MaxWorkers and startConsumer path to
locate the fix.

---

Nitpick comments:
In `@docs/research/dependency-instability-impact.md`:
- Around line 54-60: The Markdown table in dependency-instability-impact.md
needs blank lines before and after it to satisfy MD058. Update the surrounding
prose near the cross-site event table so the table is separated from adjacent
text by empty lines, keeping the existing table content unchanged.
- Line 52: The heading hierarchy is inconsistent because the “Federation
publisher map” section is using a lower-level heading directly under an h2
without an intervening h3. Update that heading in the markdown so it uses h3
instead of h4, keeping the surrounding section structure in the
dependency-instability document aligned with markdownlint MD001.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f556c96-12cf-426d-a915-eb150107d0fa

📥 Commits

Reviewing files that changed from the base of the PR and between ea1db13 and f885310.

📒 Files selected for processing (14)

docs/client-api.md
docs/research/dependency-instability-impact.md
docs/superpowers/plans/2026-06-28-room-federation-relay.md
pkg/model/event.go
pkg/model/model_test.go
pkg/subject/subject.go
pkg/subject/subject_test.go
room-service/handler.go
room-service/handler_test.go
room-worker/consumer_config_test.go
room-worker/handler.go
room-worker/handler_test.go
room-worker/integration_test.go
room-worker/main.go

Research the failure-mode impact, project/release stability, and operational-reliability data for the four core infra dependencies (NATS/JetStream, MongoDB, Cassandra, Valkey), identifying the request/reply-originated cross-site federation publish as the top durability exposure, plus the implementation plan executed in the following commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WcNmcyHTmyokFh9vYm3brj

…te events Six room-service request/reply handlers (role_updated, mute/favorite toggled, subscription_read, thread_read, room_restricted) federated cross-site events by publishing an InboxEvent inline straight to a remote site's INBOX across a supercluster gateway. On failure the error returned to the client *after* the local Mongo write committed, so local and remote diverged with no durable retry. Replace this with a durable "federation relay": each handler keeps its synchronous Mongo write and reply but publishes one RoomFederationEvent to the local ROOMS stream; room-worker forwards each wrapped InboxEvent to the destination INBOX with at-least-once retry — the source stream is the outbox. The producer publish is local-cluster only, so a remote outage can never block the user's RPC, and a destination-site outage delays the event (retry-forever with escalating backoff) rather than dropping it. - pkg/model: RoomFederationEvent + FederationTarget envelope types. - pkg/subject: RoomCanonicalFederation builder (chat.room.canonical.{siteID}.federation). - room-service: federate + buildFederationTarget helpers; six handlers converted. Wire format is byte-identical to the prior direct publishes, so inbox-worker is unchanged. - room-worker: processFederation forwards each target (transient error -> Nak/redeliver, malformed -> Ack-poison), validating destSiteID/eventType/ envelope/dedupId at the boundary, each attempt bounded by a 3s fail-fast timeout. It runs on a dedicated durable consumer + worker pool, isolated from the membership consumer (filtered to create/member.add/member.remove/ room.rename), so an unreachable destination backs up only the federation lane, never local membership processing. The federation lane retries a failed forward forever with escalating backoff (5s -> 5m, MaxDeliver=-1), so a long destination outage delays — never drops — the event. Fails fast on non-positive MAX_WORKERS. - docs/client-api.md: cross-site federation note for all six RPCs. - Tests: forwarder, the two consumer configs, all six handlers (relay envelope + byte-identical wrapped InboxEvent), a model round-trip, and an end-to-end JetStream integration round-trip. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WcNmcyHTmyokFh9vYm3brj

hmchangw changed the title ~~docs: add dependency-instability impact research report~~ Durable federation relay for room-service cross-site events Jun 28, 2026

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

Comment thread room-service/handler.go Outdated

Comment thread room-service/handler.go

Comment thread room-worker/handler_test.go Outdated

Comment thread room-worker/handler.go

Comment thread room-worker/integration_test.go

Comment thread room-worker/main.go

hmchangw force-pushed the claude/dependency-instability-impact-bn7y1h branch 4 times, most recently from 52c8a74 to 08e9aff Compare June 30, 2026 00:54

hmchangw force-pushed the claude/dependency-instability-impact-bn7y1h branch from 08e9aff to 1cdb6c0 Compare June 30, 2026 01:46

hmchangw force-pushed the claude/dependency-instability-impact-bn7y1h branch from 1cdb6c0 to 2dbac20 Compare June 30, 2026 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable federation relay for room-service cross-site events#410

Durable federation relay for room-service cross-site events#410
hmchangw wants to merge 2 commits into
mainfrom
claude/dependency-instability-impact-bn7y1h

hmchangw commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hmchangw commented Jun 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Behavior under a destination-site outage (the design goal)

Design notes

Verification

Ops notes

Also included (working docs)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hmchangw commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading