message-worker: cut per-reply DB work on the thread-reply hot path by hmchangw · Pull Request #424 · hmchangw/chat

hmchangw · 2026-06-30T06:44:16Z

Context

Investigating high MongoDB CPU under the thread max-rps load test. This PR removes work that message-worker does on every subsequent thread reply. It's three focused reductions on the thread-reply hot path.

What changed

1. thread_subscriptions writes: 3 → 1 per reply

Drop the parent-author subscription re-upsert on subsequent replies — it's created once, on the first reply — along with its owner-site FindUserByID read.
Fold the replier's own lastSeenAt $max into the replier subscription upsert via a new UpsertThreadSubscriptionAdvancingLastSeen ($setOnInsert + $max in one write, non-conflicting because lastSeenAt is owned solely by $max).
The standalone AdvanceThreadSubscriptionLastSeen now runs only on paths that write no replier sub (migration, self-reply, system message).

2. Thread-room resolution: 1 round trip (both first and subsequent replies)

Replace the create-first pattern (CreateThreadRoom → duplicate-key → GetThreadRoomByParentMessageID) with an upserting EnsureThreadRoom (FindOneAndUpdate with $setOnInsert + ReturnDocument:After).
One round trip, no failed unique-index insert on the hot path; the caller distinguishes first-vs-subsequent by comparing the returned _id to the candidate's. A rare concurrent-first-reply dup-key is resolved with a single read.

3. Stop re-stamping the parent's thread_room_id on every subsequent reply

It's immutable and already stamped once, on the first reply (handleFirstThreadReply). Removes one Cassandra write per subsequent reply.

Trade-offs (intentional)

Narrow crash windows, not worth a per-reply write to self-heal: if a first reply creates the room but crashes before writing the parent subscription / parent stamp, those stay unset for that thread. Same class of trade-off across all three changes.

Important finding — this does not move max RPS on its own

Profiling during the load test showed the thread max-rps ceiling (~500–600, with a growing consumer backlog and E1/E2 P95 creeping over SLO) is bounded by worker concurrency, not DB CPU: MAX_WORKERS and the Mongo connection pool both default to 100, while every resource sits at ≤70%. drain rate ≈ MaxWorkers / per-msg-latency ≈ 100/200ms ≈ 500/s, so the backlog grows at ~600 offered.

These changes lower per-reply Mongo/Cassandra load, which raises the DB ceiling once concurrency is unblocked — but the lever that actually moves the number is raising MAX_WORKERS and maxPoolSize together (they must move together, or the pool re-caps at 100). That's a config change, tracked separately.

Scope & testing

Contained to message-worker (a MESSAGES_CANONICAL JetStream consumer — not a chat.user.* client-facing handler, so no docs/client-api.md change).
make test (full suite, race) green; make lint clean; make sast-gosec clean; integration build compiles (EnsureThreadRoom + combined-upsert integration tests added).
Rebased onto main (linear history, no merge commit).

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Improved thread reply handling to ensure thread rooms are created/reused and to update read state more precisely when replies arrive.
Bug Fixes
- Prevented subscription “last seen” timestamps from regressing or being redundantly advanced.
- Improved reliability for thread replies when the parent message is missing and when threads are migrated.
- Adjusted subsequent-reply subscription updates to avoid unnecessary parent-related writes.
Tests
- Updated unit and Mongo integration tests to reflect the new thread-room ensuring and combined “upsert + advance” last-seen behavior.

coderabbitai · 2026-06-30T06:44:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb24b771-e2bd-44a1-a7fc-bdc3fd1606c8

📥 Commits

Reviewing files that changed from the base of the PR and between 9ce5c8c and 98e8b52.

📒 Files selected for processing (6)

message-worker/handler.go
message-worker/handler_test.go
message-worker/integration_test.go
message-worker/mock_store_test.go
message-worker/store.go
message-worker/store_mongo.go

✅ Files skipped from review due to trivial changes (1)

message-worker/mock_store_test.go

🚧 Files skipped from review as they are similar to previous changes (4)

message-worker/store.go
message-worker/integration_test.go
message-worker/store_mongo.go
message-worker/handler_test.go

📝 Walkthrough

Walkthrough

The PR replaces split thread-room lookup/create operations with EnsureThreadRoom, adds UpsertThreadSubscriptionAdvancingLastSeen, and updates thread-reply handling to track whether lastSeenAt was already advanced. Tests and mocks were updated to match the new writes and return values.

Changes

Thread room ensure and advancing subscription upsert

Layer / File(s)	Summary
ThreadStore interface and Mongo writes `message-worker/store.go`, `message-worker/store_mongo.go`, `message-worker/mock_store_test.go`, `message-worker/integration_test.go`	The thread-store contract changes to `EnsureThreadRoom` and `UpsertThreadSubscriptionAdvancingLastSeen`, and the Mongo implementation and mocks follow the new ensure/upsert behavior. Integration tests cover room idempotency, last-message updates, and monotonic `lastSeenAt` advancement.
Thread reply flow uses replierLastSeenAdvanced `message-worker/handler.go`	`processMessage`, `handleThreadRoomAndSubscriptions`, and `handleSubsequentThreadReply` now propagate `replierLastSeenAdvanced`, use `EnsureThreadRoom`, and only run the standalone last-seen advance when the combined subscription write did not already advance it.

Handler and reply-flow tests

Layer / File(s)	Summary
Handler and reply-flow tests `message-worker/handler_test.go`	The handler tests switch to `EnsureThreadRoom`, expect `UpsertThreadSubscriptionAdvancingLastSeen` on the hot path, and update return-arity and suppression assertions for subsequent replies and migration paths.

Estimated code review effort: 4 (Complex) | ~45 minutes

Possibly related PRs

hmchangw/chat#398: Shares the same thread-reply path and AdvanceThreadSubscriptionLastSeen behavior that this PR folds into the new combined upsert.
hmchangw/chat#95: Touches handleThreadRoomAndSubscriptions and subsequent-reply subscription handling in the same area of code.
hmchangw/chat#107: Modifies nearby thread-reply subscription logic and handler tests in message-worker.

Suggested labels: ready

Suggested reviewers: mliu33

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: reducing database work on the thread-reply hot path in message-worker.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/mongodb-cpu-thread-maxrps-v4kx3b

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@message-worker/handler.go`:
- Around line 267-276: The CreateThreadRoom flow in handleFirstThreadReply
handling currently loses the redelivery self-heal path for a missing parent
subscription. Update the errThreadRoomExists branch in handler.go so
redeliveries still repair the parent-author subscription before returning from
handleSubsequentThreadReply, or otherwise reintroduce an idempotent parent
upsert in the first-reply/retry path. Make sure the recovery logic is preserved
for the thread reply handling around handleFirstThreadReply and
handleSubsequentThreadReply so a partially created thread can heal on retry.

In `@message-worker/integration_test.go`:
- Around line 1948-1980: The subtests in the integration test are sharing the
same persisted subscription state, so later cases depend on mutations from
earlier ones. Make each t.Run case self-contained by seeding the document within
that subtest (or splitting them into separate top-level tests) before calling
UpsertThreadSubscriptionAdvancingLastSeen and asserting on read(), so the checks
for lastSeenAt and _id in the subscription test do not rely on execution order.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa1768c1-67b3-4649-84c1-50d6965e7c37

📥 Commits

Reviewing files that changed from the base of the PR and between aa8ca03 and 4605648.

📒 Files selected for processing (6)

message-worker/handler.go
message-worker/handler_test.go
message-worker/integration_test.go
message-worker/mock_store_test.go
message-worker/store.go
message-worker/store_mongo.go

coderabbitai · 2026-06-30T06:49:40Z

 	err := h.threadStore.CreateThreadRoom(ctx, &threadRoom)
 	switch {
 	case err == nil:
-		return threadRoom.ID, h.handleFirstThreadReply(ctx, msg, eventSiteID, threadRoom.ID, replier, now, isMigration)
+		// First reply is rare (once per thread); it advances the replier's lastSeenAt via the
+		// standalone $max in the caller, so it reports replierLastSeenAdvanced=false.
+		return threadRoom.ID, false, h.handleFirstThreadReply(ctx, msg, eventSiteID, threadRoom.ID, replier, now, isMigration)
 	case errors.Is(err, errThreadRoomExists):
 		return h.handleSubsequentThreadReply(ctx, msg, eventSiteID, replier, now, isMigration)
 	default:
-		return "", fmt.Errorf("create thread room: %w", err)
+		return "", false, fmt.Errorf("create thread room: %w", err)


🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

This drops the only retry path that repairs a partially created parent subscription.

If CreateThreadRoom succeeds and any later step in handleFirstThreadReply fails, redelivery comes back through the errThreadRoomExists branch into handleSubsequentThreadReply. This code now permanently skips the parent-author upsert, so the room can recover and persist replies while the parent's thread_subscriptions row stays missing forever. Please keep a self-heal path for the parent subscription on redelivery, even if the steady-state hot path stays at one write.

Also applies to: 354-361

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@message-worker/handler.go` around lines 267 - 276, The CreateThreadRoom flow in handleFirstThreadReply handling currently loses the redelivery self-heal path for a missing parent subscription. Update the errThreadRoomExists branch in handler.go so redeliveries still repair the parent-author subscription before returning from handleSubsequentThreadReply, or otherwise reintroduce an idempotent parent upsert in the first-reply/retry path. Make sure the recovery logic is preserved for the thread reply handling around handleFirstThreadReply and handleSubsequentThreadReply so a partially created thread can heal on retry.

Investigating high MongoDB CPU under the thread max-rps load test, three per-reply reductions on message-worker's thread-reply path — each removing work that runs on every subsequent reply: 1. thread_subscriptions writes 3 -> 1. Drop the parent-author subscription re-upsert on subsequent replies (it is created once, on the first reply) along with its owner-site lookup, and fold the replier's own lastSeenAt $max into the replier subscription upsert via a new UpsertThreadSubscriptionAdvancingLastSeen ($setOnInsert + $max in one write). The standalone AdvanceThreadSubscriptionLastSeen now runs only on paths that write no replier sub (migration, self-reply, system msg). 2. Resolve the thread room in one round trip. Replace the create-first pattern (CreateThreadRoom -> dup-key -> GetThreadRoomByParentMessageID) with an upserting EnsureThreadRoom (FindOneAndUpdate $setOnInsert, ReturnDocument:After). One round trip for both first and subsequent replies, with no failed unique-index insert; the caller distinguishes first vs subsequent by the returned _id. 3. Stop re-stamping the parent's thread_room_id on every subsequent reply. It is immutable and stamped once, on the first reply. Trade-offs (narrow crash windows, not worth a per-reply write to self-heal): a first reply that creates the room but crashes before writing the parent subscription / parent stamp leaves those unset for that thread. Note: profiling showed the thread max-rps ceiling is bounded by worker concurrency (MAX_WORKERS and the Mongo connection pool, both defaulting to 100) at <=70% resource utilization — not by these ops. These changes lower per-reply Mongo/Cassandra load, raising the DB ceiling once concurrency is unblocked; they do not by themselves move max RPS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VzvswJ23JQB1nYyskjBpi9

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

hmchangw force-pushed the claude/mongodb-cpu-thread-maxrps-v4kx3b branch from 193b5cf to 98e8b52 Compare July 1, 2026 05:24

hmchangw changed the title ~~message-worker: cut per-thread-reply thread_subscriptions writes from 3 to 1~~ message-worker: cut per-reply DB work on the thread-reply hot path Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

message-worker: cut per-reply DB work on the thread-reply hot path#424

message-worker: cut per-reply DB work on the thread-reply hot path#424
hmchangw wants to merge 1 commit into
mainfrom
claude/mongodb-cpu-thread-maxrps-v4kx3b

hmchangw commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hmchangw commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What changed

Trade-offs (intentional)

Important finding — this does not move max RPS on its own

Scope & testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hmchangw commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading