Skip to content

Durable federation relay for room-service cross-site events#410

Open
hmchangw wants to merge 2 commits into
mainfrom
claude/dependency-instability-impact-bn7y1h
Open

Durable federation relay for room-service cross-site events#410
hmchangw wants to merge 2 commits into
mainfrom
claude/dependency-instability-impact-bn7y1h

Conversation

@hmchangw

@hmchangw hmchangw commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Summary

This branch began as a research task on the impact of dependency instability (NATS/JetStream, MongoDB, Cassandra, Valkey) and then implements a fix for the top exposure that research surfaced.

The problem: six room-service request/reply handlers federated cross-site events by publishing a model.InboxEvent inline straight to a remote site's INBOX (chat.inbox.{dest}.external.{type}). That publish crosses a NATS supercluster gateway; if it fails, the error returns to the client after the local Mongo write already committed — so local and remote silently diverge with no durable retry.

The fix — a durable "federation relay": each handler keeps its synchronous Mongo write and response, but now builds the same InboxEvent bytes, wraps them in a RoomFederationEvent, and publishes one envelope to the local ROOMS stream (chat.room.canonical.{siteID}.federation). room-worker forwards each wrapped event to its destination INBOX with at-least-once retry — the source stream is the outbox. No new stream, no new service.

What changed

  • pkg/modelRoomFederationEvent + FederationTarget envelope types.
  • pkg/subjectRoomCanonicalFederation(siteID) builder.
  • room-servicefederate + buildFederationTarget helpers; six handlers converted: updateRole, muteToggle, favoriteToggle, messageRead, messageThreadRead, roomRestricted.
  • room-workerprocessFederation forwards each target to chat.inbox.{dest}.external.{type} (transient error → Nak/redeliver, malformed → Ack-poison), validating destSiteID/eventType/envelope/dedupId at the boundary with a 3s per-attempt fail-fast timeout. Runs on its own durable consumer + worker pool (room-worker-federation), isolated from the membership consumer.
  • docs/client-api.md — cross-site federation note for all six RPCs.
  • Tests — forwarder, both consumer configs, all six handlers, a model round-trip, and an end-to-end JetStream integration round-trip.

Behavior under a destination-site outage (the design goal)

Remote unreachable for Before (inline publish) This PR
10 s client error, event lost RPC succeeds; forward retried, delivered in seconds
3 min client error, event lost RPC succeeds; delivered when remote recovers
1 hour client error, event lost RPC succeeds; delivered when remote recovers
1 day client error, event lost RPC succeeds; delivered when remote recovers (bounded by ROOMS retention)

The producer publishes only to the local ROOMS stream, so a remote outage never blocks or errors the user's RPC. The federation consumer retries a failed forward forever with escalating backoff (5s → 15s → 1m → 5m, MaxDeliver=-1), so a long outage delays the event rather than dropping it.

Design notes

  • Reuses ROOMS + room-worker — no new stream or service. The .federation subject is not matched by notification-worker's exact ...event.member.muted filter.
  • Two isolated lanes on one stream: the membership consumer (FilterSubjects = create/member.add/member.remove/room.rename, default MaxDeliver=5) and the federation consumer (.federation, MaxDeliver=-1 + backoff) have separate worker pools, so an unreachable destination backs up only the federation lane — never local membership processing (member add/remove/create/rename).
  • Wire format is byte-identicalroom-service still marshals the same InboxEvent; room-worker forwards those exact bytes, so the destination inbox-worker handlers are unchanged.
  • Redelivery is safe — every destination handler is idempotent via high-water-mark $lt guards (lastSeenAt, muteUpdatedAt, favoriteUpdatedAt, rolesUpdatedAt, visibilityUpdatedAt); the stable DedupID plus those guards make a re-forward (including a timed-out-but-actually-delivered publish) a no-op.
  • Behavior changes (intentional): (1) federation is now asynchronous — a cross-site gateway hiccup no longer fails the user's local RPC; (2) the five toggle/read/role events now forward with a stable dedup ID (previously empty), an idempotency improvement with no change to the event bytes.
  • user_status_updated is deliberately left untouched (best-effort by design, owned by user-service).

Verification

  • make test (full suite, race detector): PASS
  • make lint: 0 issues; make sast gosec: PASS; no store-interface or mock changes (make generate is a no-op)
  • The room-worker integration test creates the federation consumer via buildFederationConsumerConfig and exercises the round-trip, so CI validates that nats-server accepts MaxDeliver=-1 + BackOff.

Ops notes

  • The existing room-worker durable's FilterSubjects is narrowed on deploy (supported in-place on nats-server 2.10+); the new room-worker-federation durable is self-created at startup. No stream/IaC change.
  • The relay's durability assumes the source ROOMS stream survives a node loss and retains messages for at least the longest tolerated destination outage — confirm ROOMS (and MESSAGES_CANONICAL/INBOX) are provisioned R3 + file storage with adequate MaxAge in IaC.

Also included (working docs)

  • docs/research/dependency-instability-impact.md — the dependency-instability research report.
  • docs/superpowers/plans/2026-06-28-room-federation-relay.md — the implementation plan this branch executed. Happy to drop the plan doc if you'd prefer it not ship.

Test plan

  • CI green (unit, lint, sast, and the room-worker integration test)
  • Confirm ROOMS (and MESSAGES_CANONICAL/INBOX) are R3 + file storage with adequate retention in IaC
  • On deploy, confirm the room-worker durable filter update and room-worker-federation durable creation succeed against the running nats-server

🤖 Generated with Claude Code

https://claude.ai/code/session_01WcNmcyHTmyokFh9vYm3brj

Summary by CodeRabbit

  • New Features
    • Introduced a durable, asynchronous federation relay for selected room-service cross-site events (role updates, message/thread reads, mute/favorite toggles, and room restrictions).
  • Bug Fixes
    • Improved cross-site delivery reliability by forwarding via JetStream-backed relay with at-least-once retry semantics and deduplication.
  • Documentation
    • Updated client API documentation with clarified federation coordination behavior.
    • Added new research note on dependency instability impacts and a relay implementation plan.
  • Tests
    • Added/updated unit and end-to-end tests to validate relay wrapping, forwarding, error handling, and round-trip correctness.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@hmchangw, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 52 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d188a1c-a478-4dfb-b362-2ccc347e6708

📥 Commits

Reviewing files that changed from the base of the PR and between 1cdb6c0 and 2dbac20.

📒 Files selected for processing (12)
  • docs/client-api.md
  • pkg/model/event.go
  • pkg/model/model_test.go
  • pkg/subject/subject.go
  • pkg/subject/subject_test.go
  • room-service/handler.go
  • room-service/handler_test.go
  • room-worker/consumer_config_test.go
  • room-worker/handler.go
  • room-worker/handler_test.go
  • room-worker/integration_test.go
  • room-worker/main.go
📝 Walkthrough

Walkthrough

The PR adds a ROOMS-stream federation relay for cross-site room-service events, updates room-worker to consume and forward relay events, refreshes client-facing docs and implementation notes, and adds a research document on dependency instability.

Changes

Room Federation Relay

Layer / File(s) Summary
Federation model and subject
pkg/model/event.go, pkg/model/model_test.go, pkg/subject/subject.go, pkg/subject/subject_test.go
FederationTarget and RoomFederationEvent are added, with round-trip JSON coverage and a new RoomCanonicalFederation(siteID) subject builder.
room-service federation publish path
room-service/handler.go, room-service/handler_test.go
buildFederationTarget and Handler.federate are added, and cross-site handlers are switched to publish RoomFederationEvent envelopes instead of direct external inbox publishes. Tests are updated to decode relay envelopes and embedded inbox payloads.
room-worker federation lane
room-worker/handler.go, room-worker/main.go, room-worker/consumer_config_test.go, room-worker/handler_test.go, room-worker/integration_test.go
HandleJetStreamMsg now dispatches .federation subjects to processFederation, which forwards each FederationTarget to the destination inbox. main.go splits membership and federation consumers, and tests cover config, forwarding, error handling, and an embedded JetStream round trip.
Client API and implementation plan
docs/client-api.md, docs/superpowers/plans/2026-06-28-room-federation-relay.md
The client API docs now describe ROOMS-stream federation for the affected RPCs, and the implementation plan documents the relay design and rollout tasks.

Dependency Instability Research

Layer / File(s) Summary
Dependency instability research document
docs/research/dependency-instability-impact.md
Adds a research note covering failure modes, release stability, operational reliability, recommendations, caveats, and sources for NATS/JetStream, MongoDB, Cassandra, and Valkey.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hmchangw/chat#168: Both PRs touch room-worker consumer configuration; this PR extends it with a separate federation lane and updated filter subjects.
  • hmchangw/chat#217: This PR’s federation relay path covers subscription_mute_toggled, which is directly related to the mute-toggle event flow introduced there.
  • hmchangw/chat#342: Both PRs modify room-service/handler.go’s messageRead flow and its related cross-site read event handling.

Suggested labels

ready

Suggested reviewers

  • Joey0538
  • mliu33

🐇 Hops through ROOMS, the relay is awake,
One envelope sent for each site to take.
DedupIDs twinkle, the inboxes hum,
And federated bunnies keep messages glum-free? No—fun!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.95% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a durable federation relay for room-service cross-site events.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/dependency-instability-impact-bn7y1h

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@hmchangw hmchangw changed the title docs: add dependency-instability impact research report Durable federation relay for room-service cross-site events Jun 28, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
docs/research/dependency-instability-impact.md (2)

54-60: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add blank lines around table.

Surround the table with blank lines to satisfy markdownlint MD058.

  | Cross-site event | Publisher | Origin context | Implicit outbox? |
  |---|---|---|---|
+
  | message persist / thread-subscription | `message-worker` | consumes `MESSAGES_CANONICAL` (JS) | ✅ yes (Nak → redeliver) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/research/dependency-instability-impact.md` around lines 54 - 60, The
Markdown table in dependency-instability-impact.md needs blank lines before and
after it to satisfy MD058. Update the surrounding prose near the cross-site
event table so the table is separated from adjacent text by empty lines, keeping
the existing table content unchanged.

52-52: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Fix heading hierarchy.

The h4 heading "Federation publisher map" follows an h2 without an intervening h3. Change to h3 to satisfy markdownlint MD001.

- #### Federation publisher map (who has an outbox)
+ ### Federation publisher map (who has an outbox)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/research/dependency-instability-impact.md` at line 52, The heading
hierarchy is inconsistent because the “Federation publisher map” section is
using a lower-level heading directly under an h2 without an intervening h3.
Update that heading in the markdown so it uses h3 instead of h4, keeping the
surrounding section structure in the dependency-instability document aligned
with markdownlint MD001.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@room-service/handler.go`:
- Around line 751-752: The new federation-target failure paths in the
room-service handler are returning bare errors, so update the affected handlers
to wrap `buildFederationTarget` failures with handler-specific context instead
of returning `err` directly. Add descriptive context that identifies the
federation target/action that failed for each path, following the existing
pattern in the room handler methods that cover `subscription_read`,
`thread_read`, `room_restricted`, `mute-toggled`, and `favorite-toggled`, so
logs clearly show which target caused the error.
- Around line 1977-1986: The room-restricted InboxEvent envelope is generating a
second timestamp that can drift from the payload and origin write time. In the
room service handler, update the federation flow around buildFederationTarget
and the InboxEvent creation for room_restricted so the outer envelope uses
req.Timestamp instead of calling time.Now().UTC().UnixMilli(). Keep the shared
timestamp convention consistent with RoomRestrictedInboxPayload.Timestamp and
the high-water-mark guard logic.

In `@room-worker/handler_test.go`:
- Around line 5361-5365: The test setup in the RoomFederationEvent fixture is
ignoring json.Marshal errors, which can hide broken test data and mislead
failure classification. In the handler_test.go setup around the model.InboxEvent
and model.RoomFederationEvent marshals, capture both errors and assert them with
require.NoError so fixture creation fails loudly. Use the existing test helpers
and keep the marshaling logic intact while removing the silent discard of
errors.

In `@room-worker/handler.go`:
- Around line 324-333: Reject federation targets missing eventType or dedupId
before calling h.publish in handler.go. Update the validation in the evt.Targets
loop to treat empty t.EventType and t.DedupID as invalid alongside the existing
DestSiteID and Envelope checks, and log the skip with enough context to identify
the bad target. This keeps subject.InboxExternal and h.publish from receiving
malformed inputs that would bypass the durable NATS path or route to the wrong
subject.

In `@room-worker/integration_test.go`:
- Around line 2031-2065: Create the destination INBOX stream before exercising
the federation publish path. In the integration test around the
`stream.Rooms(siteID)` setup and `processFederation` publish closure, also
create the `stream.Inbox(destSiteID)` JetStream stream so
`subject.InboxExternal(destSiteID, ...)` has a matching destination. Keep the
existing `js.CreateOrUpdateStream` pattern and ensure the stream is bound to the
inbox subject family used by the `publish`/`js.PublishMsg` path.

In `@room-worker/main.go`:
- Around line 179-190: Fail fast on invalid worker configuration by validating
cfg.MaxWorkers before startConsumer uses it. Add an early check in the main
startup/config path so MAX_WORKERS must be greater than zero, and return a clear
error instead of continuing into PullMaxMessages or make(chan struct{},
cfg.MaxWorkers). Use the existing cfg.MaxWorkers and startConsumer path to
locate the fix.

---

Nitpick comments:
In `@docs/research/dependency-instability-impact.md`:
- Around line 54-60: The Markdown table in dependency-instability-impact.md
needs blank lines before and after it to satisfy MD058. Update the surrounding
prose near the cross-site event table so the table is separated from adjacent
text by empty lines, keeping the existing table content unchanged.
- Line 52: The heading hierarchy is inconsistent because the “Federation
publisher map” section is using a lower-level heading directly under an h2
without an intervening h3. Update that heading in the markdown so it uses h3
instead of h4, keeping the surrounding section structure in the
dependency-instability document aligned with markdownlint MD001.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f556c96-12cf-426d-a915-eb150107d0fa

📥 Commits

Reviewing files that changed from the base of the PR and between ea1db13 and f885310.

📒 Files selected for processing (14)
  • docs/client-api.md
  • docs/research/dependency-instability-impact.md
  • docs/superpowers/plans/2026-06-28-room-federation-relay.md
  • pkg/model/event.go
  • pkg/model/model_test.go
  • pkg/subject/subject.go
  • pkg/subject/subject_test.go
  • room-service/handler.go
  • room-service/handler_test.go
  • room-worker/consumer_config_test.go
  • room-worker/handler.go
  • room-worker/handler_test.go
  • room-worker/integration_test.go
  • room-worker/main.go

Comment thread room-service/handler.go Outdated
Comment thread room-service/handler.go
Comment thread room-worker/handler_test.go Outdated
Comment thread room-worker/handler.go
Comment thread room-worker/integration_test.go
Comment thread room-worker/main.go
@hmchangw hmchangw force-pushed the claude/dependency-instability-impact-bn7y1h branch 4 times, most recently from 52c8a74 to 08e9aff Compare June 30, 2026 00:54
Research the failure-mode impact, project/release stability, and
operational-reliability data for the four core infra dependencies
(NATS/JetStream, MongoDB, Cassandra, Valkey), identifying the
request/reply-originated cross-site federation publish as the top
durability exposure, plus the implementation plan executed in the
following commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WcNmcyHTmyokFh9vYm3brj
@hmchangw hmchangw force-pushed the claude/dependency-instability-impact-bn7y1h branch from 08e9aff to 1cdb6c0 Compare June 30, 2026 01:46
…te events

Six room-service request/reply handlers (role_updated, mute/favorite
toggled, subscription_read, thread_read, room_restricted) federated
cross-site events by publishing an InboxEvent inline straight to a remote
site's INBOX across a supercluster gateway. On failure the error returned
to the client *after* the local Mongo write committed, so local and remote
diverged with no durable retry.

Replace this with a durable "federation relay": each handler keeps its
synchronous Mongo write and reply but publishes one RoomFederationEvent to
the local ROOMS stream; room-worker forwards each wrapped InboxEvent to the
destination INBOX with at-least-once retry — the source stream is the
outbox. The producer publish is local-cluster only, so a remote outage can
never block the user's RPC, and a destination-site outage delays the event
(retry-forever with escalating backoff) rather than dropping it.

- pkg/model: RoomFederationEvent + FederationTarget envelope types.
- pkg/subject: RoomCanonicalFederation builder
  (chat.room.canonical.{siteID}.federation).
- room-service: federate + buildFederationTarget helpers; six handlers
  converted. Wire format is byte-identical to the prior direct publishes,
  so inbox-worker is unchanged.
- room-worker: processFederation forwards each target (transient error ->
  Nak/redeliver, malformed -> Ack-poison), validating destSiteID/eventType/
  envelope/dedupId at the boundary, each attempt bounded by a 3s fail-fast
  timeout. It runs on a dedicated durable consumer + worker pool, isolated
  from the membership consumer (filtered to create/member.add/member.remove/
  room.rename), so an unreachable destination backs up only the federation
  lane, never local membership processing. The federation lane retries a
  failed forward forever with escalating backoff (5s -> 5m, MaxDeliver=-1),
  so a long destination outage delays — never drops — the event. Fails fast
  on non-positive MAX_WORKERS.
- docs/client-api.md: cross-site federation note for all six RPCs.
- Tests: forwarder, the two consumer configs, all six handlers (relay
  envelope + byte-identical wrapped InboxEvent), a model round-trip, and an
  end-to-end JetStream integration round-trip.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WcNmcyHTmyokFh9vYm3brj
@hmchangw hmchangw force-pushed the claude/dependency-instability-impact-bn7y1h branch from 1cdb6c0 to 2dbac20 Compare June 30, 2026 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants