Skip to content

chore(ops): one-off thread subscription leak reconciler#342

Open
yujiawei wants to merge 1 commit into
mainfrom
chore/thread-subscription-leak-reconcile
Open

chore(ops): one-off thread subscription leak reconciler#342
yujiawei wants to merge 1 commit into
mainfrom
chore/thread-subscription-leak-reconcile

Conversation

@yujiawei

@yujiawei yujiawei commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What

Adds a one-off, idempotent, dry-run-first maintenance command that removes
stale community-topic (thread) subscriptions left in the IM layer for members
who are no longer eligible.

Event-driven fixes only handle removals that happen going forward; they don't
reconcile subscriptions that leaked before the fix shipped. This command closes
that gap as a one-time backfill.

Scope

  • Scans group_member for members with is_deleted=1 OR blacklist status.
  • For each (group_no, uid), reuses the existing
    queryThreadShortIDsForCleanup helper to enumerate all non-deleted threads
    of that group and removes the user's subscription on each thread channel.
  • Only removes IM subscriptions — it does not delete thread_member /
    thread_setting rows or touch pins / conversation extensions. (This is
    deliberately narrower than the kick/leave cleanup path.)

Safety

  • Dry-run by default: without --apply it only counts the
    (uid, thread) pairs that would be removed; it never calls the IM API.
  • Idempotent: subscriber removal is a no-op for non-existent
    subscriptions, so re-runs are safe.
  • Rate limiting: --batch-size and --interval bound the load on the IM
    service for large groups.
  • Failures are recorded, not fatal: a failed call is logged and collected
    into the report; the run continues and exits non-zero if any failures
    occurred.

Usage

# dry-run (default): only report counts
./app -config configs/tsdd.yaml reconcile-thread-subs

# apply, with rate limiting
./app -config configs/tsdd.yaml reconcile-thread-subs --apply --interval 200ms --batch-size 50

See docs/thread-subscription-reconcile-runbook.md for the full runbook
(dry-run vs apply, parameters, exit codes, rollback/impact notes).

Tests

New unit tests cover scanning, dry-run-does-not-write, idempotency, batching,
and failure-isolation. go build ./..., go vet ./..., and the group package
tests all pass.

Rollback

No rollback needed — the tool only removes IM subscriptions and deletes no
data. Affected users are already kicked/left/blacklisted and should not be
receiving those thread messages anyway.

Part of #YUJ-4186

历史上被踢/退群(is_deleted=1)或被拉黑(status=blacklist)的成员,入群时挂进
WuKongIM 的子区频道订阅可能从未被对称摘除,事件驱动修复只覆盖未来移除、不处理
存量泄漏。新增一次性对账工具扫这两类成员,复用 queryThreadShortIDsForCleanup
枚举该群所有非 deleted 子区,逐个 IMRemoveSubscriber 摘订阅。

- 只摘订阅、不删 DB 数据(区别于被踢/退群路径的 removeUserFromGroupThreadsCleanup)
- 默认 dry-run(只统计),加 --apply 才执行;支持 --batch-size / --interval 限速
- 幂等(IMRemoveSubscriber 对不存在订阅 no-op),失败只记录不中断
- 作为主二进制子命令 reconcile-thread-subs 接入 main.go
- 单测覆盖扫描 / dry-run 不写 / 幂等 / 分批 / 失败不中断
- 附运维手册 docs/thread-subscription-reconcile-runbook.md

Part of #YUJ-4186
@yujiawei yujiawei requested a review from a team as a code owner June 11, 2026 02:58
@github-actions github-actions Bot added the size/XL PR size: XL label Jun 11, 2026

@mochashanyao mochashanyao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Octo-Q · automated review]

Verdict: Approve — no blocking findings; notes below (data-flow traced).


PR#342 Review Report — chore(ops): one-off thread subscription leak reconciler

Reviewer: Octo-Q (automated review)
PR: #342
Head SHA: 7c6feec485c16d31e5d8c5b85eb9aee350ae0ca8
Files: 4 changed (+588 / −1) — main.go, modules/group/thread_subscription_reconcile.go, modules/group/thread_subscription_reconcile_test.go, docs/thread-subscription-reconcile-runbook.md


1. Verification Summary

Item Status Evidence
Dry-run default safety Apply defaults false; Run() returns before any removeFn call when !opts.Apply (thread_subscription_reconcile.go:207-209)
Idempotency IMRemoveSubscriber is no-op for non-existent subs; TestReconcile_Idempotent verifies re-run stability
ChannelID format parity groupNo + "____" + shortID matches thread.BuildChannelID (thread/service.go:800-801, separator "____" at thread/const.go:17) and removeUserFromGroupThreadsCleanup (thread_cleanup.go:76)
ChannelType consistency common.ChannelTypeCommunityTopic.Uint8() — same as thread_cleanup.go:80, event.go:629, service.go:1658
SQL leak criteria correctness is_deleted=1 OR status=? with int(common.GroupMemberStatusBlacklist) — matches criteria used in db.go:421 and api.go:3133
queryThreadShortIDsForCleanup reuse Same function used by removeUserFromGroupThreadsCleanup (thread_cleanup.go:63); filters status!=3 (excludes deleted threads, includes active+archived)
main.go subcommand routing flag.Args() tried first, os.Args[1] fallback preserved; dash-stripping normalizes reconcile-thread-subsreconcilethreadsubs
Backward compat (api/config startup) When no subcommand, flag.Args() empty → falls to os.Args[1] path, identical to pre-PR behavior
Failure isolation Per-batch continue on error (thread_subscription_reconcile.go:232-240); TestReconcile_FailureRecordedNotAborted verifies
Rate limiting time.Sleep(opts.Interval) before each IM call; --batch-size splits large UID sets
Exit codes 0=clean, 1=scan error (err != nil), 2=partial failures (len(report.Failures) > 0) — matches runbook
Test coverage 7 tests: scan, dry-run, idempotent, dedup, failure-isolation, empty, batching

2. Findings

F1 — P2: Full-table scan on group_member without pagination

File: thread_subscription_reconcile.go:134-142
Diff-scope: new (this PR introduces the query)

_, err := r.ctx.DB().Select("group_no", "uid").
    From("group_member").
    Where("is_deleted=1 OR status=?", int(common.GroupMemberStatusBlacklist)).
    OrderAsc("group_no").
    Load(&rows)

This loads all leaked member rows across all groups into memory in a single query. On a production deployment with millions of group_member rows, this could cause a slow query and memory spike.

Mitigation already present: This is a one-off maintenance tool, not a hot path. The OrderAsc("group_no") helps if an index exists. The runbook recommends dry-run first.

Suggestion: Consider adding a note in the runbook about expected DB load for large deployments, or add an optional --limit flag for staged execution. Not blocking.

F2 — P2: Non-deterministic processing order (Go map iteration)

File: thread_subscription_reconcile.go:175 (for groupNo, uids := range byGroup)
Diff-scope: new

buildPlan iterates over map[string][]string returned by scanLeakedMembers. Go map iteration order is randomized, so the order of groups in the report and in IM-call execution varies between runs.

Impact: Cosmetic for the report; functionally irrelevant since groups are independent. Not blocking.

F3 — P2/nit: Inline channelID separator

File: thread_subscription_reconcile.go:215channelID := p.groupNo + "____" + shortID
Diff-scope: new

The "____" literal is hardcoded rather than referencing thread.ChannelIDSeparator. However, this follows the established pattern in thread_cleanup.go:76 and api.go:3074 (both in the group package, both inline). Consistent with local convention. Not blocking.

3. Suggestions

  1. Runbook addition (F1): Add a line in the runbook's "推荐执行顺序" section about expected query duration / row count for large deployments, so ops can plan accordingly.
  2. Optional: Consider --group-no filter flag to scope the reconciler to a specific group for targeted runs (useful when dry-run reveals one problematic group).

4. Additional Findings

None. The PR is narrowly scoped to its stated purpose. No unintended side effects on the API server path, no new HTTP endpoints, no changes to existing event-driven cleanup paths.

5. Data-Flow Trace

Consumed Data Upstream Source Reaches Consumer?
byGroup map (leaked members) SQL: SELECT group_no, uid FROM group_member WHERE is_deleted=1 OR status=? ✅ Rows loaded → Go-level dedup → map returned to buildPlan
shortIDs (threads per group) queryThreadShortIDsForCleanup → SQL: SELECT short_id FROM thread WHERE group_no=? AND status!=3 ✅ Returns non-deleted thread short_ids; empty slice → group skipped
channelID groupNo + "____" + shortID ✅ Matches thread.BuildChannelID format; consumed by removeFn
SubscriberRemoveReq Constructed with ChannelID, ChannelType=ChannelTypeCommunityTopic, Subscribers=chunk ✅ Same struct shape as thread_cleanup.go:78-82, event.go:629, service.go:1658
removeFn default ctx.IMRemoveSubscriber(...) ✅ Called only when opts.Apply=true; dry-run short-circuits at Run():207-209
chunk (batched UIDs) p.uids[start:end] slicing ✅ Correctly bounded; end capped at len(p.uids)

No data-flow gaps found. Every consumed value traces back to a verified upstream source.

6. Blind-Spot Checklist (R5)

C1 — Double-path parity: N/A. This is a one-off reconciler, not a paired add/remove operation. No symmetric path to maintain.

C2 — Control-flow ordering / nesting reuse: Clear. removeFn has a single call-site (the nested loop in Run). No risk of double-application or ordering issues. No regex/sanitize to test with non-canonical input.

C3 — Authorization boundary ≠ capability boundary: Clear. This is a CLI subcommand requiring server/container access. No HTTP endpoint exposed. --apply flag gates writes (default dry-run). No privilege escalation vector.


[Octo-Q] verdict: APPROVE — No P0/P1 findings. Three P2/nit observations (full-table scan, map iteration order, inline separator) are all acceptable for a one-off maintenance tool with dry-run safety, idempotency, rate limiting, and comprehensive test coverage. Well-engineered PR.

@Jerry-Xin Jerry-Xin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is relevant to octo-server and implements a scoped one-off maintenance command for existing group/thread IM subscription state.

💬 Non-blocking

  • 🟡 Warning: modules/group/thread_subscription_reconcile.go:63-64 and :241 report PairsRemoved as “actual removed” pairs, but IMRemoveSubscriber is idempotent/no-op for missing subscribers, so this is really “successfully requested pairs.” Consider renaming/reporting it that way to avoid overclaiming in ops output.
  • 🔵 Suggestion: modules/group/thread_subscription_reconcile.go:225-226 sleeps before the first IM call, although the flag is documented as an interval between calls. This is safe, but moving the sleep after each call, or gating it on report.IMCalls > 0, would match operator expectations.
  • 🔵 Suggestion: modules/group/thread_subscription_reconcile.go:171-190 builds plans by ranging over a map, so group processing and failure ordering are nondeterministic. Sorting group keys would make dry-run/apply reports easier to compare.
  • 🟡 Warning: I could not verify the new group tests in this local checkout because testutil.NewTestServer panics on existing shared test DB migration state (unknown migration in database, e.g. 20210926000001_robot_legacy01.sql). This appears environmental, not caused by this PR.

✅ Highlights

  • Dry-run default is correctly enforced before any removeFn call (modules/group/thread_subscription_reconcile.go:203-205).
  • The cleanup scope intentionally reuses queryThreadShortIDsForCleanup, preserving the existing active+archived, non-deleted thread behavior (modules/group/thread_subscription_reconcile.go:176).
  • Failure isolation is implemented as described: individual IM removal failures are logged, recorded, and do not abort later batches (modules/group/thread_subscription_reconcile.go:229-239).
  • Tests cover the key behaviors: scanning, dry-run no-write, idempotent rerun, batching, and failure isolation.

@yujiawei yujiawei left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #342 (octo-server)

Verdict: APPROVED

A one-off, idempotent, dry-run-first maintenance command (reconcile-thread-subs) that backfills the historical thread-subscription leak. The core logic is correct and well-tested, build/vet/all 7 new unit tests pass, and the design choices (only removing IM subscriptions, never deleting DB rows; dry-run by default; failure-isolation) are sound and appropriately conservative. No correctness, security, data-loss, or build-breaking issues — nothing blocks merge. The notes below are P2 robustness/observability and documentation-accuracy items plus a few nits; none requires a change before merge.

What was verified ✅

  • Scan predicate is correct. scanLeakedMembers uses WHERE is_deleted=1 OR status=2 (thread_subscription_reconcile.go:137-141). GroupMemberStatusBlacklist == 2 / Normal == 1 are the only two statuses in octo-lib common/constant.go, so the predicate exactly and exclusively covers the two documented leak classes (kicked/left + blacklisted). The OR string is the sole argument to a single Where(), so there is no SQL precedence hazard.
  • Reuse is correct. It correctly reuses queryThreadShortIDsForCleanup (returns status!=3, i.e. active + archived, excludes deleted) and builds the channel ID as {groupNo}____{shortID}, matching the existing kick/leave path (thread_cleanup.go:77, service.go:1887).
  • Dry-run never writes. Run returns before the apply loop when Apply=false (thread_subscription_reconcile.go:203-206); IMCalls/PairsRemoved stay 0. Covered by TestReconcile_DryRunDoesNotWrite.
  • Batching & failure-isolation are correct. batch-size <= 0 falls back to 100 (prevents an infinite start += 0 loop); a failed removal is logged + recorded + continued without aborting the run; the failure record defensively copies the chunk (append([]string(nil), chunk...)).
  • main.go flag refactor is backward-compatible. All existing invocations still route correctly: bare ./app and ./app -config x.yaml fall through flag.Args() (empty) to the os.Args branch → runAPI; ./app -config x.yaml reconcile-thread-subs --applyflag.Args()=["reconcile-thread-subs", ...] → the new subcommand. go build ./... and go vet ./modules/group/ are clean.

Findings (all non-blocking)

P2 — Report can undercount when a per-group thread query fails (observability)

thread_subscription_reconcile.go:172-191GroupsAffected and LeakedMembers are incremented before the per-group queryThreadShortIDsForCleanup call. If that query fails (continue at :183) or returns zero threads (continue at :186), those leaked members still count toward LeakedMembers/GroupsAffected but contribute nothing to ThreadsScanned/PairsPlanned. So a run with thread-query failures can show e.g. LeakedMembers=5000, PairsPlanned=200, which an operator could misread as "almost nothing to do." This is recoverable (failures appear in report.Failures, the tool is re-runnable), but the aggregate counters can mislead.
Suggestion: track a GroupsSkippedThreadQueryFailed counter and surface it in String(), or add a one-line note (in the field comment or runbook) telling operators to reconcile LeakedMembers vs PairsPlanned against the Failures list.

P2 — Runbook exit-code table is imprecise; dry-run can exit 2

docs/thread-subscription-reconcile-runbook.md:56-58 — The table presents exit 1 as "扫描阶段出错(如 DB 查询失败)" and exit 2 as "摘除阶段有失败项". In the actual code:

  • Only the top-level scanLeakedMembers failure propagates as an error → exit 1 (main.go exit-1 branch).
  • A per-group thread-query DB failure is recorded into report.Failures and yields exit 2 (main.go exit-2 branch) — even in dry-run, because buildPlan runs before the dry-run early return.

So "a DB query failed" maps to exit 1 or exit 2 depending on which query failed, and exit 2 is not purely an apply-stage outcome. The binary behavior is fine; only the doc is ambiguous for ops automation that branches on exit codes.
Suggestion: exit 1 = the initial member scan failed (aborts before planning); exit 2 = one or more per-group operations failed (thread query OR IMRemoveSubscriber), run completed — and note dry-run can also exit 2. (The full report.String() already disambiguates: thread-query failures render with an empty channel= and an err=query threads: ... prefix.)

P2 — Unbounded in-memory scan

thread_subscription_reconcile.go:136-161scanLeakedMembers loads every (group_no, uid) matching is_deleted=1 OR status=blacklist across all groups into memory at once (rows slice + dedup maps + plans), with no LIMIT/pagination. Since this tool exists precisely because leakage accumulated over the whole bug lifetime, on a large deployment this set can be very large, and the recommended invocation is inside a memory-limited container (docker exec). Per-record footprint is small (two short strings), so this is a scalability/robustness concern, not a guaranteed OOM.
Suggestion: iterate group-by-group (SELECT DISTINCT group_no, then scan each group's leaked members) so only one group's working set is resident; at minimum, note an expected-rows ceiling / "run off-peak" in the runbook.

P2 — Idempotency promise rests on an unasserted WuKongIM behavior

thread_subscription_reconcile.go:31 + runbook lines 20/64 — The "safe to re-run / safe to run early" guidance depends on IMRemoveSubscriber being a server-side no-op for a non-subscribed uid. The idempotency unit test uses a mock that always returns nil, so it proves the tool re-issues identical calls, not that WuKongIM tolerates the redundant remove. If WuKongIM ever errored on a missing subscriber, re-runs would populate report.Failures and exit 2, contradicting the "harmless re-run" claim. (Mitigation: the identical call shape is already relied on by the shipped kick/exit path — thread_cleanup.go:78, event.go:629/800, service.go:1658 — so this is established codebase behavior, just undocumented here.)
Suggestion: cite the WuKongIM no-op-on-missing-subscriber guarantee in the comment/runbook, or soften the wording to "failures on already-removed subscribers are recorded but non-fatal."

P2 — Test gaps (build-passing, but coverage holes)

  • thread_subscription_reconcile_test.go:141-153TestReconcile_DedupesMemberAcrossDeletedAndBlacklist inserts a single row, so the seen-map dedup branch (scanLeakedMembers:155-158) never fires; the test would pass even if the dedup code were deleted. To actually exercise it, insert two rows for the same (groupNo, uid).
  • --interval (the production rate-limit knob the runbook recommends) has no test — a regression that drops/misplaces the sleep would silently hammer WuKongIM and go uncaught. A robust check: dry-run with Interval set still makes zero removeFn calls (avoid wall-clock assertions, which are flaky in CI).
  • No multi-group test: cross-group counter summation and per-group plan isolation (group A's uids not leaking into group B's channels) are unverified (map iteration order is nondeterministic). Logic looks correct by inspection, but it is untested.
  • The per-group thread-query-failure path (Failures + exit 2) and the top-level scan-failure path (exit 1) — the two exit codes the runbook documents — are both untested. Consider a queryFn seam analogous to removeFn, or a drop-table test asserting the err=query threads: shape with empty ChannelID.

Nits

  • thread_subscription_reconcile.go:146-159 — the cross-row dedup map is effectively dead code given the group_no+uid unique index on group_member (defensive, harmless).
  • thread_subscription_reconcile.go:218-228 — the rate-limit time.Sleep fires before the first IM call, wasting exactly one interval globally; conventional shape is sleep-between (skip when IMCalls==0).
  • main.go:90-107./app -config=x.yaml (equals-form, no subcommand) puts -config=x.yaml into flag.Args(), so serverType becomes a non-matching value and the process exits 0 doing nothing. This is a pre-existing quirk of the bare-arg dispatch (the space-form -config x.yaml works), not introduced by this PR, but worth being aware of.
  • main.go:105-108 — an unknown subcommand / stray arg exits 0 silently (pre-existing). A small default: branch logging "unknown subcommand" would improve operability.
  • Apply-mode report prints two extra lines (实际摘除订阅对, IM 调用次数) that the runbook's sample (DRY-RUN only) never illustrates; an apply-mode sample block would make the runbook complete.

Summary

Solid, conservative, well-scoped one-off tool. The reconciliation logic, reuse of the existing cleanup helper, dry-run gating, and failure-isolation are all correct and verified. Remaining items are observability/doc-accuracy refinements and test-coverage strengthening — none blocks merge. Recommend addressing the report-undercount note and the runbook exit-code clarification as a small follow-up, since this is an ops tool whose primary interfaces are its report output and exit codes.

@yujiawei yujiawei self-assigned this Jun 11, 2026
@lml2468 lml2468 added stage:review Review phase - QA/security/code review review:running:qa qa-engineer review in progress review:running:security security-engineer review in progress review:running:code code-reviewer review in progress labels Jun 27, 2026
@lml2468

lml2468 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

QA Verdict: PASS-WITH-NITS

Scope reviewed: 4 files, +588/-1 (one-off reconciler subcommand + runbook + unit tests).

Coverage assessment

Strong coverage on the acceptance-critical invariants:

  • TestReconcile_DryRunDoesNotWrite — locks in the "dry-run default never calls IMRemoveSubscriber" contract; this is the highest-risk regression and it is gated.
  • TestReconcile_ScansLeakedMembersAndThreads — confirms only is_deleted=1 OR status=blacklist are picked up, normal members stay, and status=3 (deleted) threads are excluded.
  • TestReconcile_Idempotent — re-run safety asserted via stable PairsRemoved/PairsPlanned.
  • TestReconcile_DedupesMemberAcrossDeletedAndBlacklist — guards the OR-clause dedup edge.
  • TestReconcile_FailureRecordedNotAborted — single bad channel isolated, other channels still removed.
  • TestReconcile_BatchingSplitsUIDsBatchSize=2 over 3 uids → 2 calls, PairsRemoved=3.
  • TestReconcile_NoLeakedMembers — empty-input report is all zeros, no IM call.

CI: Build, Test, Vet, Lint, code-review (workflow run 27321751871 et al) all SUCCESS; referencing per skill §4 (no local re-run needed for this verdict).

Gaps (nits, not blockers)

  1. No multi-group scan test. Every test uses a single groupNo. byGroup aggregation across groups, report.GroupsAffected incrementing >1, and buildPlan looping multiple plans are not exercised. A 2-group fixture would cost ~10 LOC and harden the only non-trivial aggregation path.
  2. No Interval rate-limit test. time.Sleep side-effect is uncovered; an injectable clock or a >0 interval with assert on observable side-effect (e.g. wall time floor) would close this. Low priority — the code path is 3 lines.
  3. queryThreadShortIDsForCleanup failure path is unreached. buildPlan records a failure and continues on query error, but no test injects this. Hard to test without a DB hook; acceptable risk.
  4. ReconcileReport.String() format is uncovered. Operators read this output directly per the runbook; a snapshot/golden test would catch accidental format drift that breaks log scrapers.
  5. Exit-code paths in runReconcileThreadSubs (main.go) are not unit-tested. Acceptable for a thin wiring shim; the runbook documents exits 0/1/2 so a manual smoke against the binary would suffice — recommend the operator runs a dry-run in staging before --apply.

Verify completeness

  • Runbook (docs/thread-subscription-reconcile-runbook.md) clearly states dry-run-first → --apply workflow, parameters, exit codes, and rollback note (none needed — only removes IM subscriptions).
  • Recommended execution order in runbook §"推荐执行顺序" is sound: dry-run → wait for upstream blacklist-filter fix → --apply with --interval.
  • No evidence of staging dry-run report attached to PR. Suggest operator runs reconcile-thread-subs (dry-run) in staging and pastes the report before --apply in prod.

Verdict

PASS-WITH-NITS. Safety-critical paths (dry-run default, idempotency, failure isolation, deleted-thread exclusion) are covered. Gaps above are quality-of-life, not blockers — recommend addressing 1 (multi-group test) in a follow-up if this code ever grows beyond a one-off.

@lml2468 lml2468 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA review complete — see verdict comment above (PASS-WITH-NITS).

Summary: Safety-critical invariants (dry-run default, idempotency, batching, failure isolation, deleted-thread exclusion, OR-clause dedup) all covered. CI green (Build/Test/Vet/Lint/code-review).

Non-blocking nits: missing multi-group scan test, Interval not exercised, buildPlan query-error path uncovered, ReconcileReport.String() has no golden test, exit-code paths in runReconcileThreadSubs not unit-tested.

Operator ask: Please attach a staging dry-run report to this PR (or in the linked issue) before --apply in prod.

@lml2468 lml2468 added review:done:qa:comment qa-engineer PASS-WITH-RISK and removed review:running:qa qa-engineer review in progress labels Jun 27, 2026
@lml2468

lml2468 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Security Verdict: CLEARED

Scope reviewed: 4 files, +588/-1 (one-off ops CLI subcommand; only IM-side mutation via IMRemoveSubscriber).

STRIDE / OWASP pass

  • Spoofing / AuthN: out of scope — this is an operator CLI, not an HTTP endpoint. Trust boundary is "anyone with shell into the container can run this." Documented in runbook (docker exec ... reconcile-thread-subs). Acceptable for a one-off ops tool; no new auth surface added.
  • Tampering: only writes are IMRemoveSubscriber calls; no DB row mutation. Idempotent + non-destructive (subscription can be re-attached by the normal join path). No SQL writes added.
  • Repudiation: failures are logged with zap (groupNo, channelID, uid count, err) and surfaced in ReconcileReport.Failures. Exit code 2 on any failure. Audit trail is adequate for a one-off run.
  • Information disclosure: ReconcileFailure.UIDs stores raw uids in memory, but ReconcileReport.String() only renders uids=<count> — uids are not printed to operator stdout. Good. UIDs do reach zap.Int("uids", len(chunk)) in error logs (length only, not values). No PII leak.
  • DoS: rate-limited via --interval (operator-controlled); --batch-size bounds request size. No unbounded fan-out per group — each chunk is <= BatchSize. Safe for production IM.
  • Elevation of privilege: none. Tool only removes subscriptions; it cannot add. The action is strictly de-privileging (revoking subscriptions for already-kicked/blacklisted members).

SQL injection

Where("is_deleted=1 OR status=?", int(common.GroupMemberStatusBlacklist))status is parameterized; is_deleted=1 is a literal constant. No user input ever reaches the query. Safe.

Crypto / secrets

No crypto, no secret handling, no new auth tokens. No changes to pkg/auth or any credential path.

SBOM / dependency

gh pr view: no dependencies-changed label, no go.mod/go.sum diff. No new dependencies introduced. osv-scanner CI run = SUCCESS.

Behavioural safety

  • dry-run default: enforced both in code (opts.Apply default false) and in test (TestReconcile_DryRunDoesNotWrite asserts zero IMRemoveSubscriber calls). This is the dominant security control and it is gated.
  • Failure isolation: a poisoned channel/batch does not propagate; the rest of the run continues. Prevents partial-outage amplification.
  • No DB row deletion: explicitly narrower than removeUserFromGroupThreadsCleanup (which deletes thread_member / thread_setting). Documented in the file header comment; reduces blast radius if this tool is ever misconfigured.

Risks (informational, not blocking)

  • R1 — log volume on apply with many failures: each failure logs at ERROR with zap.Error(rmErr). A large-scale IM outage during --apply would flood error logs. Operator-mitigable via --interval and small --batch-size. Note in runbook would be nice but not required.
  • R2 — non-deterministic UID ordering: byGroup aggregation goes through a Go map, so the order uids are sent to IMRemoveSubscriber per channel is non-deterministic. Functionally fine (the IM API is set-based), but it means re-runs of the same input may interleave failures differently — slightly harder forensics. Acceptable for a one-off.

Verdict

CLEARED — no security blockers. Adversarial path coverage is good: SQL is parameterized, no new auth/secret surface, the only side-effect is an explicitly-narrow IM unsubscribe, dry-run-default is test-enforced, rate limiting is operator-controlled.

@lml2468 lml2468 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security review complete — see verdict comment above (CLEARED).

Summary: No new auth/secret surface, no SQL injection (status is parameterized), no new dependencies (osv-scanner SUCCESS), dry-run default is test-enforced, only side-effect is IM unsubscribe (non-destructive, idempotent).

Informational risks: (R1) error-log volume on large --apply failures, mitigable via --interval. (R2) non-deterministic UID ordering in batches — functionally fine, just slightly harder forensics.

No blockers. Approve for security gate.

@lml2468 lml2468 added review:done:security:approve security-engineer CLEARED and removed review:running:security security-engineer review in progress labels Jun 27, 2026
@lml2468

lml2468 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Code Verdict: APPROVE-WITH-NITS

Scope reviewed: 4 files, +588/-1.

Correctness

  • scanLeakedMembers predicate is_deleted=1 OR status=? matches the runbook's stated leak criteria (踢/退群 OR 拉黑). Aggregation + per-group uid dedup is correct.
  • buildPlan short-circuits cleanly on per-group thread-query failure (records ReconcileFailure, continues, no partial double-counting in ThreadsScanned / PairsPlanned). Verified by reading the order: ++GroupsAffected and LeakedMembers happen before the queryThreadShortIDsForCleanup call — so a query failure leaves GroupsAffected incremented but ThreadsScanned/PairsPlanned not. Minor semantic question (see Nit 3), but consistent.
  • Run honors Apply=false with an early return after buildPlan — dry-run still produces full plan counts. Matches ReconcileReport.DryRun contract.
  • Channel ID construction p.groupNo + "____" + shortID (4 underscores) matches the format declared in the file header comment ("与 thread.BuildChannelID 及 removeUserFromGroupThreadsCleanup 一致").
  • Batching arithmetic: for start := 0; start < len(p.uids); start += batchSize { end := start + batchSize; if end > len ... } — standard, correct, no off-by-one.
  • Interval sleep is applied per-call (per channel × per batch), not per-batch globally. For a group with many threads this multiplies the wait — likely intentional (cap IM QPS), but worth a runbook line.

Design / fit

  • Splitting into ThreadSubscriptionReconciler + injected removeFn is a clean test seam — avoids needing a fake WuKongIM in unit tests. Mirrors patterns used elsewhere in modules/group.
  • Exporting only RunThreadSubscriptionReconcile from main.go's perspective keeps queryThreadShortIDsForCleanup (unexported) inside the package. Good encapsulation.
  • ReconcileOptions / ReconcileReport / ReconcileFailure are well-typed; nothing leaks interface{}.
  • File-level comment block (40+ lines) explains the why (event-driven fixes don't backfill; this closes the gap) and the what-not (no DB row deletion, no pin/conversation touch) — excellent context for future readers.

Readability

  • Naming is clear: LeakedMembers, PairsPlanned, PairsRemoved, IMCalls, groupPlan. Counter semantics are obvious from the names alone.
  • Comments in Chinese match the runbook and the existing modules/group style.
  • ReconcileReport.String() output is operator-readable and matches the runbook's sample report verbatim — operators can grep for these labels.

Nits (non-blocking)

  1. Dead panic(err) in main.go:runReconcileThreadSubs:

    fs := flag.NewFlagSet("reconcile-thread-subs", flag.ExitOnError)
    ...
    if err := fs.Parse(args); err != nil { panic(err) }

    flag.ExitOnError already calls os.Exit(2) on parse failure; the panic(err) branch is unreachable. Either change to flag.ContinueOnError (and keep the panic), or drop the if err != nil block. Cosmetic.

  2. Hardcoded "____" separator (4 underscores) duplicated: comment says it must stay in sync with thread.BuildChannelID and removeUserFromGroupThreadsCleanup. If BuildChannelID is callable from this package, calling it (instead of string-concatenating) would prevent silent drift. If not exported, a package-level const threadChannelSep = "____" near the existing helpers and reused both here and in removeUserFromGroupThreadsCleanup would centralize it. Not a regression risk today — but the comment itself is a flag that the author already feels the duplication.

  3. GroupsAffected counts groups with zero successful plans: a group whose queryThreadShortIDsForCleanup fails still increments GroupsAffected and LeakedMembers, even though zero pairs end up planned/removed. The report ends up reading "12 groups affected, 0 pairs planned" in the worst case, which is slightly misleading. Consider either (a) only increment after successful plan build, or (b) rename to GroupsScanned to match the actual semantic. Cosmetic.

  4. OrderAsc("group_no") in scanLeakedMembers: the SQL ORDER BY group_no is preserved in the row slice, but then dropped on entry into the byGroup map (Go maps are unordered). The ORDER BY is dead — either remove it (slight DB win on large group_member) or replace byGroup with an ordered structure if deterministic output matters for operator debugging. Cosmetic.

  5. Runbook nit: --interval is described as "每次 IM 调用之间的休眠"; clarify that this applies per (channel, batch) pair, not per group — for a group with 38 threads + batched uids, the cumulative wait can be much larger than operators expect.

Verdict

APPROVE-WITH-NITS. Code is correct, well-tested, well-documented, and appropriately narrow for a one-off ops tool. Nothing blocks merge; nits are cleanup opportunities for a follow-up if any.

@lml2468 lml2468 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review complete — see verdict comment above (APPROVE-WITH-NITS).

Summary: Correct (predicate, batching, dry-run early-return, channel ID format all check out), well-designed (clean removeFn test seam, good encapsulation), readable (counter names self-explain, file-header explains the why), and CI green.

Non-blocking nits: (1) unreachable panic(err) after flag.ExitOnError; (2) duplicated "____" channel separator — author's own comment flags the coupling to thread.BuildChannelID; (3) GroupsAffected increments on query-failed groups (cosmetic); (4) dead OrderAsc("group_no") (overwritten by map); (5) runbook should note --interval applies per (channel, batch).

Recommend addressing nits 1 & 2 in a follow-up; not blocking this merge.

@lml2468 lml2468 added review:done:code:approve code-reviewer APPROVED and removed review:running:code code-reviewer review in progress labels Jun 27, 2026
@lml2468

lml2468 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Aggregate Verdict: RISKED — needs-human-review

Three reviewer verdicts are in. QA carries a non-blocking risk flag (PASS-WITH-NITS/with-risk), security cleared, code approved-with-nits. Per loop policy, any single with-risk verdict routes to a human gate rather than auto-APPROVED.

role label verdict one-line summary
qa review:done:qa:comment PASS-WITH-NITS safety-critical paths (dry-run default, idempotency, failure isolation, deleted-thread exclusion) all covered; nits = multi-group test, interval-sleep test, report-format snapshot — none blocking
security review:done:security:approve CLEARED SQL parameterized, no new auth/secret surface, only side-effect is narrow IM unsubscribe, dry-run-default is test-enforced, rate limiting operator-controlled
code review:done:code:approve APPROVE-WITH-NITS correct + well-typed + well-documented; nits = unreachable panic after flag.ExitOnError, duplicated "____" separator, GroupsAffected semantics, dead OrderAsc, runbook --interval wording

Next step

  • Human merger to decide whether the QA nits (multi-group test fixture, format snapshot, runbook --interval clarification) need follow-up before merge or after.
  • Code nits are cosmetic and can ride a follow-up PR if any.
  • No changes are required to ship this PR as-is.

Posted by pr-tick-octo-server autopilot. Loop will not merge — merge is human-only.

@lml2468 lml2468 added review:complete 3 verdicts aggregated, awaiting human merge needs-human-review labels Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-human-review review:complete 3 verdicts aggregated, awaiting human merge review:done:code:approve code-reviewer APPROVED review:done:qa:comment qa-engineer PASS-WITH-RISK review:done:security:approve security-engineer CLEARED size/XL PR size: XL stage:review Review phase - QA/security/code review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants