review(branch): performance, observability

claude · claude · commit 20f12c51c7d7 · 2026-06-23T09:41:03.000Z
diff --git a/docs/reviews/branch-claude-oplog-transformer-collections-2026-06-23-2.md b/docs/reviews/branch-claude-oplog-transformer-collections-2026-06-23-2.md
@@ -128,3 +128,32 @@ Strengths (no action): inbox-worker `handler_test.go` SubID-adoption (single-vs-
 - `nitpick` — `users.go:113-135`: user-status fan-out cannot loop/amplify — inbox-worker `handleUserStatusUpdated` writes Mongo (LWW), does not re-publish. Confirmed safe.
 
 Verified-correct: `Classify` ordering (poison/skip precede `isFinal`); `IsFinalDelivery` guards `maxDeliver<=0`; empty `parentMessage._id`/`username` + decode failures correctly Term (not Nak-storm); delete events get the shorter `DeleteMaxDeliver` cap; unknown-user `member_added` correctly Naks; no nil-deref/unchecked assertion; `mustMarshal` panics only on fixed-shape structs; no leaked secrets/tokens/bodies.
+
+---
+
+# Lens: Performance
+
+Scope is a one-time migration replay pump (single replica), so avoiding Nak-storms and silent stalls matters more than raw throughput.
+
+- `medium` — `users.go:113-135` (`publishUserStatus`): payload marshaled once (good), but `h.nowMillis()` is called per loop iteration for `evt.Timestamp`, and on any per-dest publish error the function returns → Naks the **whole** event → re-fans to *all* sites on redelivery, re-publishing to destinations that already succeeded. With N sites that's N duplicate publishes per transient blip. Idempotent (status LWW), so write-amplification not correctness; hoist `nowMillis()` and consider a per-dest/fan-out metric.
+- `medium` — `threadsubs.go:81,100`: per-event `FindThreadRoom` + `FindUserID` = two uncached Mongo reads per thread-sub event, **every redelivery included**. Both are indexed point-reads (`thread_rooms.parentMessageId`, `users.account`), so cheap individually, but a thread-sub arriving before its room/user re-runs *both* on every Nak cycle → dominant Mongo load under a large replay. The resolver-cache (account→userId, parentMessage→threadRoom) is the documented mitigation and was **explicitly deferred by the user** — flagging as a known opportunity only. The empty-FK poison guards correctly prevent invalid rows from Nak-storming.
+- `medium` — `targetstore.go:64-88`: `FindThreadRoom`/`FindUserID` fetch whole documents (dup of the projection finding) — wasted bytes/decode on the hottest read path.
+- `low` — `main.go:169`: sequential `cons.Consume()` caps throughput at one in-flight event. Acceptable given single-replica intent, ordering simplicity, and the resolver-miss Nak/retry model; flagging the trade-off, not a defect.
+- `low` — `subscriptions.go:308`: `mustMarshal` per event (up to 5 per insert via `publishSubscriptionState`); fixed-shape small structs, negligible at migration scale.
+- `nitpick` — `inbox-worker` legacy index `DropOne` runs every startup (idempotent, one extra admin command per boot); `subscription_deleted` `DeleteOne(_id)` is an optimal primary-key delete.
+
+No goroutine leaks, no blocking calls in the consume callback beyond intended I/O; shutdown ordering (`cc.Stop()` → drain → Mongo) correct.
+
+---
+
+# Lens: Observability
+
+Strong baseline: slog-JSON everywhere, `request_id` stamped once at `processOne` entry (`natsutil.StampRequestID`) and threaded via ctx into every disposition log + the inbox publish, nil-safe metrics with op+collection dispositions, `TermExhausted` metered instead of silent JetStream drops. No `fmt.Print`/`log.Print`; no err-logged-and-returned double-logs in the new handlers.
+
+- `medium` — `subscriptions.go:80-91`: subscription hard-delete is **observability-silent**. The delete branch publishes `subscription_deleted` and returns nil → `onProcessed("delete", subsColl)`, indistinguishable from any other processed delete. A mass-leave storm or a bug turning every delete into a downstream no-op is invisible. Add a reason label or dedicated delete counter.
+- `medium` — `subscriptions.go:98`, `users.go:56`, `threadsubs.go:58`: `onSkipped(ctx, ev.Op+"_skip")` builds Prometheus labels from connector-controlled `ev.Op`. A malformed/garbage `op` becomes a new time series (unbounded-cardinality risk). Map to a fixed enum before labeling. Low real-world risk (fixed connector), but a latent cardinality vector.
+- `low` — `users.go:113-135`: user-status fan-out not separately observable — no per-dest or fan-out-total counter; the Nak log (`main.go:248`) carries only `eventId`, not the failing dest, so a single-site routing gap is hard to localize.
+- `low` — `handler.go:80` + mappers: no OTel child span on the per-collection handlers (only `migration source.findByID` spans). Trace waterfalls can't attribute latency to FK resolution vs publish vs decode. Not a regression; a gap for a multi-hop migration path.
+- `nitpick` — `sourcelookup.go:35`: shared tracer renamed to `"migration"` while `oplog-transformer/historyclient.go:44` still uses `"oplog-transformer"` — two instrumentation-scope names coexist in one binary. Intentional/harmless; note for dashboard authors.
+
+PII/secrets: clean. No log emits source bodies, `statusText` content, or tokens; usernames appear only in error strings (`account %q`), acceptable. Request-ID propagation correct across both services, including the new `handleSubscriptionDeleted`. Pre-existing gap: `inbox-worker/handler.go:119` `unknown event type` warn omits request_id (not introduced here; the new `subscription_deleted` case sits above it).