Skip to content

message-worker: cut per-reply DB work on the thread-reply hot path#424

Open
hmchangw wants to merge 1 commit into
mainfrom
claude/mongodb-cpu-thread-maxrps-v4kx3b
Open

message-worker: cut per-reply DB work on the thread-reply hot path#424
hmchangw wants to merge 1 commit into
mainfrom
claude/mongodb-cpu-thread-maxrps-v4kx3b

Conversation

@hmchangw

@hmchangw hmchangw commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Context

Investigating high MongoDB CPU under the thread max-rps load test. This PR removes work that message-worker does on every subsequent thread reply. It's three focused reductions on the thread-reply hot path.

What changed

1. thread_subscriptions writes: 3 → 1 per reply

  • Drop the parent-author subscription re-upsert on subsequent replies — it's created once, on the first reply — along with its owner-site FindUserByID read.
  • Fold the replier's own lastSeenAt $max into the replier subscription upsert via a new UpsertThreadSubscriptionAdvancingLastSeen ($setOnInsert + $max in one write, non-conflicting because lastSeenAt is owned solely by $max).
  • The standalone AdvanceThreadSubscriptionLastSeen now runs only on paths that write no replier sub (migration, self-reply, system message).

2. Thread-room resolution: 1 round trip (both first and subsequent replies)

  • Replace the create-first pattern (CreateThreadRoom → duplicate-key → GetThreadRoomByParentMessageID) with an upserting EnsureThreadRoom (FindOneAndUpdate with $setOnInsert + ReturnDocument:After).
  • One round trip, no failed unique-index insert on the hot path; the caller distinguishes first-vs-subsequent by comparing the returned _id to the candidate's. A rare concurrent-first-reply dup-key is resolved with a single read.

3. Stop re-stamping the parent's thread_room_id on every subsequent reply

  • It's immutable and already stamped once, on the first reply (handleFirstThreadReply). Removes one Cassandra write per subsequent reply.

Trade-offs (intentional)

Narrow crash windows, not worth a per-reply write to self-heal: if a first reply creates the room but crashes before writing the parent subscription / parent stamp, those stay unset for that thread. Same class of trade-off across all three changes.

Important finding — this does not move max RPS on its own

Profiling during the load test showed the thread max-rps ceiling (~500–600, with a growing consumer backlog and E1/E2 P95 creeping over SLO) is bounded by worker concurrency, not DB CPU: MAX_WORKERS and the Mongo connection pool both default to 100, while every resource sits at ≤70%. drain rate ≈ MaxWorkers / per-msg-latency ≈ 100/200ms ≈ 500/s, so the backlog grows at ~600 offered.

These changes lower per-reply Mongo/Cassandra load, which raises the DB ceiling once concurrency is unblocked — but the lever that actually moves the number is raising MAX_WORKERS and maxPoolSize together (they must move together, or the pool re-caps at 100). That's a config change, tracked separately.

Scope & testing

  • Contained to message-worker (a MESSAGES_CANONICAL JetStream consumer — not a chat.user.* client-facing handler, so no docs/client-api.md change).
  • make test (full suite, race) green; make lint clean; make sast-gosec clean; integration build compiles (EnsureThreadRoom + combined-upsert integration tests added).
  • Rebased onto main (linear history, no merge commit).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Improved thread reply handling to ensure thread rooms are created/reused and to update read state more precisely when replies arrive.
  • Bug Fixes

    • Prevented subscription “last seen” timestamps from regressing or being redundantly advanced.
    • Improved reliability for thread replies when the parent message is missing and when threads are migrated.
    • Adjusted subsequent-reply subscription updates to avoid unnecessary parent-related writes.
  • Tests

    • Updated unit and Mongo integration tests to reflect the new thread-room ensuring and combined “upsert + advance” last-seen behavior.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb24b771-e2bd-44a1-a7fc-bdc3fd1606c8

📥 Commits

Reviewing files that changed from the base of the PR and between 9ce5c8c and 98e8b52.

📒 Files selected for processing (6)
  • message-worker/handler.go
  • message-worker/handler_test.go
  • message-worker/integration_test.go
  • message-worker/mock_store_test.go
  • message-worker/store.go
  • message-worker/store_mongo.go
✅ Files skipped from review due to trivial changes (1)
  • message-worker/mock_store_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • message-worker/store.go
  • message-worker/integration_test.go
  • message-worker/store_mongo.go
  • message-worker/handler_test.go

📝 Walkthrough

Walkthrough

The PR replaces split thread-room lookup/create operations with EnsureThreadRoom, adds UpsertThreadSubscriptionAdvancingLastSeen, and updates thread-reply handling to track whether lastSeenAt was already advanced. Tests and mocks were updated to match the new writes and return values.

Changes

Thread room ensure and advancing subscription upsert

Layer / File(s) Summary
ThreadStore interface and Mongo writes
message-worker/store.go, message-worker/store_mongo.go, message-worker/mock_store_test.go, message-worker/integration_test.go
The thread-store contract changes to EnsureThreadRoom and UpsertThreadSubscriptionAdvancingLastSeen, and the Mongo implementation and mocks follow the new ensure/upsert behavior. Integration tests cover room idempotency, last-message updates, and monotonic lastSeenAt advancement.
Thread reply flow uses replierLastSeenAdvanced
message-worker/handler.go
processMessage, handleThreadRoomAndSubscriptions, and handleSubsequentThreadReply now propagate replierLastSeenAdvanced, use EnsureThreadRoom, and only run the standalone last-seen advance when the combined subscription write did not already advance it.

Handler and reply-flow tests

Layer / File(s) Summary
Handler and reply-flow tests
message-worker/handler_test.go
The handler tests switch to EnsureThreadRoom, expect UpsertThreadSubscriptionAdvancingLastSeen on the hot path, and update return-arity and suppression assertions for subsequent replies and migration paths.

Estimated code review effort: 4 (Complex) | ~45 minutes

Possibly related PRs

  • hmchangw/chat#398: Shares the same thread-reply path and AdvanceThreadSubscriptionLastSeen behavior that this PR folds into the new combined upsert.
  • hmchangw/chat#95: Touches handleThreadRoomAndSubscriptions and subsequent-reply subscription handling in the same area of code.
  • hmchangw/chat#107: Modifies nearby thread-reply subscription logic and handler tests in message-worker.

Suggested labels: ready

Suggested reviewers: mliu33

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: reducing database work on the thread-reply hot path in message-worker.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/mongodb-cpu-thread-maxrps-v4kx3b

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@message-worker/handler.go`:
- Around line 267-276: The CreateThreadRoom flow in handleFirstThreadReply
handling currently loses the redelivery self-heal path for a missing parent
subscription. Update the errThreadRoomExists branch in handler.go so
redeliveries still repair the parent-author subscription before returning from
handleSubsequentThreadReply, or otherwise reintroduce an idempotent parent
upsert in the first-reply/retry path. Make sure the recovery logic is preserved
for the thread reply handling around handleFirstThreadReply and
handleSubsequentThreadReply so a partially created thread can heal on retry.

In `@message-worker/integration_test.go`:
- Around line 1948-1980: The subtests in the integration test are sharing the
same persisted subscription state, so later cases depend on mutations from
earlier ones. Make each t.Run case self-contained by seeding the document within
that subtest (or splitting them into separate top-level tests) before calling
UpsertThreadSubscriptionAdvancingLastSeen and asserting on read(), so the checks
for lastSeenAt and _id in the subscription test do not rely on execution order.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa1768c1-67b3-4649-84c1-50d6965e7c37

📥 Commits

Reviewing files that changed from the base of the PR and between aa8ca03 and 4605648.

📒 Files selected for processing (6)
  • message-worker/handler.go
  • message-worker/handler_test.go
  • message-worker/integration_test.go
  • message-worker/mock_store_test.go
  • message-worker/store.go
  • message-worker/store_mongo.go

Comment thread message-worker/handler.go Outdated
Comment on lines +267 to +276
err := h.threadStore.CreateThreadRoom(ctx, &threadRoom)
switch {
case err == nil:
return threadRoom.ID, h.handleFirstThreadReply(ctx, msg, eventSiteID, threadRoom.ID, replier, now, isMigration)
// First reply is rare (once per thread); it advances the replier's lastSeenAt via the
// standalone $max in the caller, so it reports replierLastSeenAdvanced=false.
return threadRoom.ID, false, h.handleFirstThreadReply(ctx, msg, eventSiteID, threadRoom.ID, replier, now, isMigration)
case errors.Is(err, errThreadRoomExists):
return h.handleSubsequentThreadReply(ctx, msg, eventSiteID, replier, now, isMigration)
default:
return "", fmt.Errorf("create thread room: %w", err)
return "", false, fmt.Errorf("create thread room: %w", err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

This drops the only retry path that repairs a partially created parent subscription.

If CreateThreadRoom succeeds and any later step in handleFirstThreadReply fails, redelivery comes back through the errThreadRoomExists branch into handleSubsequentThreadReply. This code now permanently skips the parent-author upsert, so the room can recover and persist replies while the parent's thread_subscriptions row stays missing forever. Please keep a self-heal path for the parent subscription on redelivery, even if the steady-state hot path stays at one write.

Also applies to: 354-361

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@message-worker/handler.go` around lines 267 - 276, The CreateThreadRoom flow
in handleFirstThreadReply handling currently loses the redelivery self-heal path
for a missing parent subscription. Update the errThreadRoomExists branch in
handler.go so redeliveries still repair the parent-author subscription before
returning from handleSubsequentThreadReply, or otherwise reintroduce an
idempotent parent upsert in the first-reply/retry path. Make sure the recovery
logic is preserved for the thread reply handling around handleFirstThreadReply
and handleSubsequentThreadReply so a partially created thread can heal on retry.

Comment thread message-worker/integration_test.go
Investigating high MongoDB CPU under the thread max-rps load test, three per-reply
reductions on message-worker's thread-reply path — each removing work that runs on
every subsequent reply:

1. thread_subscriptions writes 3 -> 1. Drop the parent-author subscription re-upsert
   on subsequent replies (it is created once, on the first reply) along with its
   owner-site lookup, and fold the replier's own lastSeenAt $max into the replier
   subscription upsert via a new UpsertThreadSubscriptionAdvancingLastSeen
   ($setOnInsert + $max in one write). The standalone AdvanceThreadSubscriptionLastSeen
   now runs only on paths that write no replier sub (migration, self-reply, system msg).

2. Resolve the thread room in one round trip. Replace the create-first pattern
   (CreateThreadRoom -> dup-key -> GetThreadRoomByParentMessageID) with an upserting
   EnsureThreadRoom (FindOneAndUpdate $setOnInsert, ReturnDocument:After). One round
   trip for both first and subsequent replies, with no failed unique-index insert; the
   caller distinguishes first vs subsequent by the returned _id.

3. Stop re-stamping the parent's thread_room_id on every subsequent reply. It is
   immutable and stamped once, on the first reply.

Trade-offs (narrow crash windows, not worth a per-reply write to self-heal): a first
reply that creates the room but crashes before writing the parent subscription / parent
stamp leaves those unset for that thread.

Note: profiling showed the thread max-rps ceiling is bounded by worker concurrency
(MAX_WORKERS and the Mongo connection pool, both defaulting to 100) at <=70% resource
utilization — not by these ops. These changes lower per-reply Mongo/Cassandra load,
raising the DB ceiling once concurrency is unblocked; they do not by themselves move
max RPS.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VzvswJ23JQB1nYyskjBpi9
@hmchangw hmchangw force-pushed the claude/mongodb-cpu-thread-maxrps-v4kx3b branch from 193b5cf to 98e8b52 Compare July 1, 2026 05:24
@hmchangw hmchangw changed the title message-worker: cut per-thread-reply thread_subscriptions writes from 3 to 1 message-worker: cut per-reply DB work on the thread-reply hot path Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants