Skip to content

fix(group): atomic/reconciled group create β€” no permanent orphan rows (#394)#410

Open
lml2468 wants to merge 1 commit into
mainfrom
fix/oct-16-group-channel-reconcile
Open

fix(group): atomic/reconciled group create β€” no permanent orphan rows (#394)#410
lml2468 wants to merge 1 commit into
mainfrom
fix/oct-16-group-channel-reconcile

Conversation

@lml2468

@lml2468 lml2468 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #394. group.CreateGroup created the IM channel after tx.Commit(); on IM-create failure it ran a best-effort, non-atomic compensating DELETE that only logged. A DB blip on that delete β€” or a crash between commit and IM-create β€” left an orphan group with no backing IM channel, permanently queryable.

  • channel_synced flag (migration, DEFAULT 1 for backward compat). CreateGroup writes new rows as 0 in-tx and flips to 1 only after a confirmed IM channel create. Existing rows / other insert paths keep the DB default 1, so they're never mis-detected as orphans.
  • Compensating delete kept (narrows the window) but orphans are now detectable via channel_synced=0 instead of silently permanent.
  • Idempotent periodic reconcile worker: finds channel_synced=0 rows older than a grace window, re-creates the IM channel (IMCreateOrUpdateChannel is create-or-update β†’ idempotent) and flips the flag. Optional Redis tick-lock dedups across instances; lock-free fallback stays correct via idempotency. Disabled under cfg.Test; interval/grace tunable via DM_GROUP_CHANNEL_RECONCILE_INTERVAL_SEC / _GRACE_SEC.
  • Observability: Prometheus metrics for channel-create result, reconcile ticks, orphans detected, and per-orphan outcome.
  • IM-create call injected on Service so failure modes are unit-testable.

Acceptance (#394)

  • βœ… A group can never remain queryable without a backing IM channel after a create failure β€” orphans are now reconciled (eventual consistency), not permanent.
  • βœ… Reconcile is observable (metrics + structured logs) and idempotent.

API contract

No REST/WS contract change. channel_synced is internal; the group-create response shape is unchanged.

Test plan

  • TestCreateGroup_Success_MarksChannelSynced β€” happy path flips flag to 1
  • TestCreateGroup_IMFail_CompensatingDeleteRemovesGroup β€” IM-fail + delete-OK leaves no orphan
  • TestChannelReconcile_RecreatesChannelAndFlipsFlag β€” crash-between-commit-and-IM / delete-fail residue recovered
  • TestChannelReconcile_Idempotent β€” second run is a no-op
  • TestChannelReconcile_RespectsGraceWindow β€” in-flight creates not raced
  • TestChannelReconcile_SkipsDisbandedGroup β€” disbanded channels not recreated
  • TestChannelReconcile_IMFailLeavesOrphanForRetry β€” flag stays 0 for retry on IM failure
  • full go test ./modules/group/ green; go build ./... green

πŸ€– OCT-16 (sub-task of OCT-10)

…#394)

group.CreateGroup created the IM channel after tx.Commit(); on IM-create
failure it ran a best-effort, non-atomic compensating DELETE that only
logged. A DB blip on that delete β€” or a crash between commit and IM-create β€”
left an orphan group with no backing IM channel, permanently queryable.

- Add channel_synced flag (migration, DEFAULT 1 for backward compat) on the
  group table. CreateGroup writes new rows as 0 in-tx and flips to 1 only
  after a confirmed IM channel create. Existing rows / other insert paths
  keep the DB default 1, so they are never mis-detected as orphans.
- Keep the compensating delete (narrows the window) but orphans are now
  detectable via channel_synced=0 instead of silently permanent.
- Add an idempotent periodic reconcile worker: finds channel_synced=0 rows
  older than a grace window, re-creates the IM channel (create-or-update is
  idempotent) and flips the flag. Optional Redis tick-lock dedups across
  instances; lock-free fallback stays correct via idempotency. Disabled
  under cfg.Test; interval/grace tunable via env.
- Prometheus metrics for channel-create result, reconcile ticks, orphans
  detected, and per-orphan outcome.
- Inject the IM-create call on Service so failure modes are unit-testable.
  Tests cover IM-fail compensation, crash-between-commit-and-IM recovery,
  idempotent re-run, grace window, and disbanded-group skip.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@lml2468 lml2468 requested a review from a team as a code owner June 17, 2026 16:14
@github-actions github-actions Bot added the size/XL PR size: XL label Jun 17, 2026

@mochashanyao mochashanyao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Octo-Q Β· automated review]

Verdict: Approve β€” no blocking findings; notes below (data-flow traced).


octo-server PR#410 Review Report β€” automated review

Reviewer: Octo-Q (automated review)
PR: #410
Head SHA: de9f843c83256ae41be23b8fe6474e4452fd391e
Title: fix(group): atomic/reconciled group create β€” no permanent orphan rows (#394)
Changed files: 7 (+736 / -8)


1. Verification Summary

Item Status Evidence
channel_synced column migration (idempotent, DEFAULT 1) βœ… sql/20260617000001_group_channel_synced.sql β€” INFORMATION_SCHEMA guard + stored proc, matches project pattern (#239/#253)
Model struct NOT carrying channel_synced βœ… db.go:622-649 β€” no ChannelSynced field; AttrToUnderscore won't leak it into other Insert paths
Other Insert paths keep DB default 1 βœ… event.go:173 (system group), event.go:315 (registered group), service.go:194 (AddGroup) β€” all use InsertTx/Insert with Model struct, no markChannelPendingTx call
markChannelPendingTx inside tx before commit βœ… service.go:1148-1152 β€” called after InsertTx, before tx.Commit() at line 1261
MarkChannelSynced after IM create success βœ… service.go:1316-1319 β€” after imCreateChannel returns nil
Compensating delete preserved on IM fail βœ… service.go:1299-1312 β€” delete group_member then group, same as before
Reconciler queries only GroupStatusNormal βœ… db.go:78-85 β€” WHERE channel_synced=0 AND status=? with GroupStatusNormal; disbanded/disabled excluded
Grace window prevents racing in-flight creates βœ… db.go:82 β€” created_at < DATE_SUB(NOW(), INTERVAL ? SECOND)
IM create is idempotent (IMCreateOrUpdateChannel) βœ… config.Context interface β€” create-or-update semantics, safe for reconcile replay
GetSubscribableMemberUIDs excludes blacklist/deleted βœ… db.go:530-537 β€” is_deleted=0 AND status=GroupMemberStatusNormal
Redis lock: SET NX EX + Lua CAS-DEL βœ… channel_reconcile.go:82-86 (Acquire), channel_reconcile.go:72-77 (Lua release script)
Lock failure degrades to lock-free (idempotent) βœ… channel_reconcile.go:196-199 β€” logs warning, continues
Test coverage βœ… 7 tests covering happy path, IM fail + compensating delete, reconcile recovery, idempotency, grace window, disbanded skip, IM retry
Test mode disables worker βœ… api.go:89 β€” if g.ctx.GetConfig().Test { return }

2. Findings

P2-1: QueryReconcilableGroupNos uses fmt.Sprintf for SQL interpolation

File: db.go:82
Diff-scope: new (introduced by this PR)

Where(fmt.Sprintf("created_at < DATE_SUB(NOW(), INTERVAL %d SECOND)", graceSeconds))

graceSeconds is int so %d is safe from SQL injection today. However, this pattern is fragile β€” a future refactor changing the parameter type to string would silently introduce a SQL injection. The project's other queries consistently use ? parameterization.

Recommendation: Use dbr.Expr("created_at < DATE_SUB(NOW(), INTERVAL ? SECOND)", graceSeconds) or a raw Where("created_at < DATE_SUB(NOW(), INTERVAL ? SECOND)", graceSeconds) to keep it parameterized.

P2-2: Redis lock client never closed

File: channel_reconcile.go:93-99, api.go:119
Diff-scope: new (introduced by this PR)

newRedisReconcileLock creates a dedicated rd.Client via rd.NewClient(...), but:

  • ChannelReconciler stores only the reconcileTickLock interface, not the concrete *redisReconcileLock.
  • redisReconcileLock.Close() exists but is never called.
  • Group struct stores reconciler *ChannelReconciler but has no shutdown hook to close the lock's Redis connection pool.

On process exit the OS reclaims sockets, so this is not a production correctness issue. But it leaks FDs in long-running integration tests or if the module is ever hot-reloaded.

Recommendation: Either (a) have ChannelReconciler store the concrete lock and expose a Close() that calls lock.Close(), invoked from a module shutdown hook, or (b) reuse the existing shared Redis client from octoredis instead of creating a dedicated one.

Nit-1: No retry cap for permanently-failing orphans

File: channel_reconcile.go:228-267
Diff-scope: new

If the IM service is permanently unavailable for a specific group, the reconciler retries every 2 minutes indefinitely. This generates continuous im_fail metrics and log noise. Consider adding a retry_count column or a last_attempt_at timestamp to implement exponential backoff or a max-retry cutoff with alerting.

Nit-2: Partial compensating delete edge case

File: service.go:1303-1312
Diff-scope: pre-existing (amplified visibility)

If IM creation fails β†’ compensating delete of group_member succeeds β†’ compensating delete of group fails, the group row remains with channel_synced=0 and zero members. The reconciler will then:

  1. Call GetSubscribableMemberUIDs β†’ returns empty
  2. Mark channel_synced=1 (skip)

Result: a queryable group with zero members and an unnecessary IM channel. This is a pre-existing edge case (old code left the same orphan permanently), and the PR's behavior is arguably better (at least the channel exists), but worth noting.

3. Recommendations

  1. Parameterize the grace-seconds SQL (P2-1) β€” low-effort, high-value defensive coding.
  2. Wire up Redis lock cleanup (P2-2) β€” either close on shutdown or reuse the shared client.
  3. Consider a retry cap / backoff for the reconciler (Nit-1) as a follow-up.

4. Additional Observations

  • The event.go system-group and registered-group creation paths create the IM channel before tx.Commit() and rollback on failure β€” they have no orphan window and correctly don't need channel_synced treatment. The PR's scope (only CreateGroup in service.go) is correct.
  • The integration/api_groups.go path calls groupService.CreateGroup() and thus inherits the fix automatically.
  • The migration's DEFAULT 1 design is well-thought-out: existing rows and non-CreateGroup insert paths all get channel_synced=1 without any data migration.

5. Data-Flow Trace

Consumed Data Upstream Source Flows Correctly?
channel_synced=0 in reconcile query markChannelPendingTx β†’ tx.Update SET channel_synced=0 inside CreateGroup tx, before commit βœ… β€” committed atomically with group row
channel_synced=1 flip MarkChannelSynced β†’ session.Update SET channel_synced=1 after IM create success βœ… β€” failure is non-blocking, reconcile retries
Subscribers in reconcile IM create GetSubscribableMemberUIDs β†’ querySubscribableMemberUIDsWithGroupNo (is_deleted=0 AND status=Normal) βœ… β€” excludes blacklist/deleted, same source as 1module.go Subscribers callback
groupNos in reconcile scan QueryReconcilableGroupNos β†’ SELECT group_no FROM group WHERE channel_synced=0 AND status=1 AND created_at < grace βœ… β€” correctly filters by status and grace window
imCreateChannel injection Service.imCreateChannel field, default = ctx.IMCreateOrUpdateChannel in NewService βœ… β€” tests inject mock, production uses real IM call
DB default channel_synced=1 for non-CreateGroup paths Migration DEFAULT 1 + Model struct has no ChannelSynced field β†’ AttrToUnderscore omits it βœ… β€” verified event.go:173, event.go:315, service.go:194 all use Model-based Insert

6. R5 Blind-Spot Checklist (security_sensitive)

  • C1 β€” Dual-path parity: N/A. This PR adds a create-side safety flag; there is no symmetric "un-create" path. The compensating delete is best-effort and unchanged in structure. The reconcile worker only creates (never deletes) channels β€” asymmetric by design.
  • C2 β€” Control-flow ordering / nesting reuse: Clear. markChannelPendingTx runs inside the tx before commit (correct ordering). MarkChannelSynced runs after IM success (correct). The Lua CAS-DEL lock release is the standard safe pattern. The fmt.Sprintf SQL interpolation uses %d with an int β€” safe for current code but fragile (see P2-1).
  • C3 β€” Authorization boundary β‰  capability boundary: N/A. The reconciler is a system-internal background worker with no user-facing endpoint. It operates on DB rows, not user requests. No new API surface is exposed.
  • C4 β€” Authorization lifecycle / container-member cascade: N/A. This PR does not touch authorization or access control. The channel_synced flag is an internal operational flag, not an authorization gate.

7. Cross-Round Blocker Recheck

N/A β€” first review of this PR.


[Octo-Q] verdict: APPROVE

No P0/P1 findings. The PR correctly addresses the orphan-group problem with a well-designed two-layer defense (in-tx pending flag + idempotent reconcile worker). Data-flow tracing confirms all consumed data reaches its consumption point correctly. The two P2 findings (SQL interpolation style, Redis client leak) are maintainability improvements that don't block landing.

@yujiawei yujiawei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review β€” PR #410 (octo-server)

Scope reviewed: head de9f843c83256ae41be23b8fe6474e4452fd391e against base main (merge-base fe8da596). 7 files, +736/-8. go build ./... green; the PR's new tests (TestChannelReconcile_*, TestCreateGroup_Success_MarksChannelSynced, TestCreateGroup_IMFail_CompensatingDeleteRemovesGroup) pass in isolation. (The full modules/group package hits a pre-existing Error 1040: Too many connections at test-server setup in an unrelated test β€” an environment connection-cap issue, not introduced by this PR.)

This was reviewed with extra care as a security-sensitive change. The new Redis tick-lock is the only security-relevant surface and it is implemented correctly (SET NX EX acquire + token-aware Lua CAS-DEL release), matching the established pattern in modules/oidc.

Verdict: APPROVED

The change is sound and strictly improves on the status quo: a create-time IM failure used to leave a permanently orphaned, queryable group; it now leaves a detectable, self-healing one (channel_synced=0 β†’ reconcile). The migration is backward-compatible (DEFAULT 1, guarded/idempotent procedure pattern), the insert/flag are correctly transactional, and the design (forward-recovery via an idempotent create-or-update) is internally consistent and well-tested.

No P0/P1 blocker was confirmed. The findings below are P2 hardening items and one design decision that merits explicit human product sign-off (this PR carries needs-human-review).


Findings

β‘  Reconcile vs. teardown race β€” reconciler can recreate a channel for a group being torn down (P2)

modules/group/channel_reconcile.go:236-267

QueryReconcilableGroupNos filters status = GroupStatusNormal, so disbanded groups are normally excluded. But within a single tick there is a TOCTOU window: a candidate selected at scan time can be disbanded (or its rollback delete can land) between selection and the imCreateChannel call, after which the worker recreates the IM channel from a stale member snapshot and treats the subsequent zero-row MarkChannelSynced as success.

  • Impact: a stray IM channel for a group that should no longer have one. No data loss, bounded to one tick's per-item window, and the precondition is an already-rare orphan.
  • Suggested hardening: re-read status/channel_synced (ideally a conditional UPDATE ... WHERE channel_synced=0 AND status=Normal) immediately before recreating, and act on RowsAffected from MarkChannelSynced instead of discarding it.

β‘‘ Compensating-delete vs. concurrent recovery β€” unconditional delete can orphan a just-recovered channel (P2)

modules/group/service.go:1291-1315

If the CreateGroup IM-create call blocks past the grace window (β‰₯120s) and then returns an error, another instance's reconciler may have already recreated the channel and flipped channel_synced=1. The failure branch then deletes the rows unconditionally, leaving the recovered IM channel orphaned and returning a false "create failed."

  • Impact: reverse-orphan (channel without group) plus a misleading error, only under extreme IM latency (>grace). Narrow and degradation-only.
  • Suggested hardening: make the compensating delete conditional (DELETE ... WHERE channel_synced=0), so it no-ops once a row has been recovered.

β‘’ Worker lifecycle is never shut down (P2)

modules/group/api.go:90-122, modules/group/channel_reconcile.go:188-193

startChannelReconciler() launches a ticker goroutine with context.Background() and constructs a dedicated Redis client, but the group module registers no Stop in its register.Module (modules/group/1module.go), so neither ChannelReconciler.Stop() nor redisReconcileLock.Close() is ever called on graceful shutdown / module reload.

Note: the in-code rationale ("main.go only calls module.Setup, not module.Start") is inaccurate β€” server.Start() does call module.Start(ctx), and the cited modules/oidc precedent it follows registers Start: o.Init, Stop: o.Close and cleans up both its worker goroutine and its Redis client in Close(). Tests are unaffected (the cfg.Test guard early-returns, so no goroutine/connection leak in the suite); the leak is production-shutdown-only (process is exiting anyway), hence P2 rather than a blocker.

  • Suggested fix: mirror the oidc precedent β€” register Stop on the group module, have it call reconciler.Stop() and close the lock's Redis client.

β‘£ Start() mutates cancel/wg without a mutex (P2, defensive)

modules/group/channel_reconcile.go:220-246

Start() documents itself as safe to call repeatedly but mutates r.cancel/r.wg unsynchronized. In the current wiring it is only ever called once, so this is latent β€” worth a mutex if it ever becomes re-callable from multiple goroutines.


Design decision to confirm (human sign-off)

The reconciler implements forward-recovery: a group whose creation returned an error to the caller (commit succeeded, IM-create + compensating-delete both failed) is later completed into a fully working group by the worker, rather than cleaned up. This is deliberate and documented, and it satisfies the stated acceptance criteria of the linked issue. The alternative β€” cleanup-forward (delete the orphan, honoring the failure the user already saw) β€” is also defensible. Because this is the central product trade-off of the change and the PR is flagged for human review, the owner should explicitly confirm "resurrect" over "clean up" is the intended semantics.


Notes on the review process

Three independent review passes were run and reconciled. One cross-pass finding ("the group insert commits outside the transaction with the default channel_synced=1") was rejected on verification: service.go:1129 uses s.db.InsertTx(&Model{...}, tx), and markChannelPendingTx runs in the same tx, so insert + flag commit atomically β€” there is no non-transactional pre-commit window. A second pass framed findings β‘ /β‘‘ as P0 data-corruption; on evidence review they are narrow, bounded, non-data-loss edge cases that improve on the baseline (which had permanent orphans), so they are recorded here as P2.

Coverage / not assessed from the diff

  • Runtime idempotency of ctx.IMCreateOrUpdateChannel is taken on the PR's word (create-or-update) β€” not verified against the WuKongIM-side implementation here.
  • Production behavior of the Redis tick-lock under real multi-instance contention and Redis failover was reasoned about, not exercised.
  • Metric cardinality is fine (fixed label sets, no per-group labels); /metrics exposure wiring is out of this diff.

@yujiawei

Copy link
Copy Markdown
Contributor

Heads-up on the red Test check (non-blocking, not a logic failure): the failure is Error 1040: Too many connections in the modules/group package (TestGroupInviteDetail_IncludesSpaceName_Anonymous this run; the exact victim test varies run-to-run). Root cause is the shared test harness β€” testutil.NewTestServer() opens a fresh MySQL pool per call and never closes it, so the package's now-233 test functions (this PR adds 7) can exceed the CI MySQL max_connections. The new tests here pass cleanly in isolation, and main is green. Likely needs either a small bump to CI max_connections headroom or a harness fix to close pools in teardown β€” flagging so the 1040 isn't mistaken for a defect in this change.

@yujiawei

Copy link
Copy Markdown
Contributor

One additional non-blocking note (does not change the approval): QueryReconcilableGroupNos (modules/group/db.go:75) builds the grace-window predicate with fmt.Sprintf("created_at < DATE_SUB(NOW(), INTERVAL %d SECOND)", graceSeconds) rather than ? parameterization. It's safe today because graceSeconds is typed int, but it diverges from the parameterized style used elsewhere in this file and would silently become injectable if the parameter type ever changed. Suggest Where("created_at < DATE_SUB(NOW(), INTERVAL ? SECOND)", graceSeconds) to keep it consistent and future-proof.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CreateGroup compensating delete is non-atomic β†’ orphan group rows (needs async reconcile)

3 participants