fix(group): atomic/reconciled group create β no permanent orphan rows (#394)#410
fix(group): atomic/reconciled group create β no permanent orphan rows (#394)#410lml2468 wants to merge 1 commit into
Conversation
β¦#394) group.CreateGroup created the IM channel after tx.Commit(); on IM-create failure it ran a best-effort, non-atomic compensating DELETE that only logged. A DB blip on that delete β or a crash between commit and IM-create β left an orphan group with no backing IM channel, permanently queryable. - Add channel_synced flag (migration, DEFAULT 1 for backward compat) on the group table. CreateGroup writes new rows as 0 in-tx and flips to 1 only after a confirmed IM channel create. Existing rows / other insert paths keep the DB default 1, so they are never mis-detected as orphans. - Keep the compensating delete (narrows the window) but orphans are now detectable via channel_synced=0 instead of silently permanent. - Add an idempotent periodic reconcile worker: finds channel_synced=0 rows older than a grace window, re-creates the IM channel (create-or-update is idempotent) and flips the flag. Optional Redis tick-lock dedups across instances; lock-free fallback stays correct via idempotency. Disabled under cfg.Test; interval/grace tunable via env. - Prometheus metrics for channel-create result, reconcile ticks, orphans detected, and per-orphan outcome. - Inject the IM-create call on Service so failure modes are unit-testable. Tests cover IM-fail compensation, crash-between-commit-and-IM recovery, idempotent re-run, grace window, and disbanded-group skip. Co-Authored-By: Paperclip <noreply@paperclip.ing>
mochashanyao
left a comment
There was a problem hiding this comment.
[Octo-Q Β· automated review]
Verdict: Approve β no blocking findings; notes below (data-flow traced).
octo-server PR#410 Review Report β automated review
Reviewer: Octo-Q (automated review)
PR: #410
Head SHA: de9f843c83256ae41be23b8fe6474e4452fd391e
Title: fix(group): atomic/reconciled group create β no permanent orphan rows (#394)
Changed files: 7 (+736 / -8)
1. Verification Summary
| Item | Status | Evidence |
|---|---|---|
| channel_synced column migration (idempotent, DEFAULT 1) | β | sql/20260617000001_group_channel_synced.sql β INFORMATION_SCHEMA guard + stored proc, matches project pattern (#239/#253) |
| Model struct NOT carrying channel_synced | β | db.go:622-649 β no ChannelSynced field; AttrToUnderscore won't leak it into other Insert paths |
| Other Insert paths keep DB default 1 | β | event.go:173 (system group), event.go:315 (registered group), service.go:194 (AddGroup) β all use InsertTx/Insert with Model struct, no markChannelPendingTx call |
| markChannelPendingTx inside tx before commit | β | service.go:1148-1152 β called after InsertTx, before tx.Commit() at line 1261 |
| MarkChannelSynced after IM create success | β | service.go:1316-1319 β after imCreateChannel returns nil |
| Compensating delete preserved on IM fail | β | service.go:1299-1312 β delete group_member then group, same as before |
| Reconciler queries only GroupStatusNormal | β | db.go:78-85 β WHERE channel_synced=0 AND status=? with GroupStatusNormal; disbanded/disabled excluded |
| Grace window prevents racing in-flight creates | β | db.go:82 β created_at < DATE_SUB(NOW(), INTERVAL ? SECOND) |
| IM create is idempotent (IMCreateOrUpdateChannel) | β | config.Context interface β create-or-update semantics, safe for reconcile replay |
| GetSubscribableMemberUIDs excludes blacklist/deleted | β | db.go:530-537 β is_deleted=0 AND status=GroupMemberStatusNormal |
| Redis lock: SET NX EX + Lua CAS-DEL | β | channel_reconcile.go:82-86 (Acquire), channel_reconcile.go:72-77 (Lua release script) |
| Lock failure degrades to lock-free (idempotent) | β | channel_reconcile.go:196-199 β logs warning, continues |
| Test coverage | β | 7 tests covering happy path, IM fail + compensating delete, reconcile recovery, idempotency, grace window, disbanded skip, IM retry |
| Test mode disables worker | β | api.go:89 β if g.ctx.GetConfig().Test { return } |
2. Findings
P2-1: QueryReconcilableGroupNos uses fmt.Sprintf for SQL interpolation
File: db.go:82
Diff-scope: new (introduced by this PR)
Where(fmt.Sprintf("created_at < DATE_SUB(NOW(), INTERVAL %d SECOND)", graceSeconds))graceSeconds is int so %d is safe from SQL injection today. However, this pattern is fragile β a future refactor changing the parameter type to string would silently introduce a SQL injection. The project's other queries consistently use ? parameterization.
Recommendation: Use dbr.Expr("created_at < DATE_SUB(NOW(), INTERVAL ? SECOND)", graceSeconds) or a raw Where("created_at < DATE_SUB(NOW(), INTERVAL ? SECOND)", graceSeconds) to keep it parameterized.
P2-2: Redis lock client never closed
File: channel_reconcile.go:93-99, api.go:119
Diff-scope: new (introduced by this PR)
newRedisReconcileLock creates a dedicated rd.Client via rd.NewClient(...), but:
ChannelReconcilerstores only thereconcileTickLockinterface, not the concrete*redisReconcileLock.redisReconcileLock.Close()exists but is never called.Groupstruct storesreconciler *ChannelReconcilerbut has no shutdown hook to close the lock's Redis connection pool.
On process exit the OS reclaims sockets, so this is not a production correctness issue. But it leaks FDs in long-running integration tests or if the module is ever hot-reloaded.
Recommendation: Either (a) have ChannelReconciler store the concrete lock and expose a Close() that calls lock.Close(), invoked from a module shutdown hook, or (b) reuse the existing shared Redis client from octoredis instead of creating a dedicated one.
Nit-1: No retry cap for permanently-failing orphans
File: channel_reconcile.go:228-267
Diff-scope: new
If the IM service is permanently unavailable for a specific group, the reconciler retries every 2 minutes indefinitely. This generates continuous im_fail metrics and log noise. Consider adding a retry_count column or a last_attempt_at timestamp to implement exponential backoff or a max-retry cutoff with alerting.
Nit-2: Partial compensating delete edge case
File: service.go:1303-1312
Diff-scope: pre-existing (amplified visibility)
If IM creation fails β compensating delete of group_member succeeds β compensating delete of group fails, the group row remains with channel_synced=0 and zero members. The reconciler will then:
- Call
GetSubscribableMemberUIDsβ returns empty - Mark
channel_synced=1(skip)
Result: a queryable group with zero members and an unnecessary IM channel. This is a pre-existing edge case (old code left the same orphan permanently), and the PR's behavior is arguably better (at least the channel exists), but worth noting.
3. Recommendations
- Parameterize the grace-seconds SQL (P2-1) β low-effort, high-value defensive coding.
- Wire up Redis lock cleanup (P2-2) β either close on shutdown or reuse the shared client.
- Consider a retry cap / backoff for the reconciler (Nit-1) as a follow-up.
4. Additional Observations
- The
event.gosystem-group and registered-group creation paths create the IM channel beforetx.Commit()and rollback on failure β they have no orphan window and correctly don't needchannel_syncedtreatment. The PR's scope (onlyCreateGroupinservice.go) is correct. - The
integration/api_groups.gopath callsgroupService.CreateGroup()and thus inherits the fix automatically. - The migration's
DEFAULT 1design is well-thought-out: existing rows and non-CreateGroup insert paths all getchannel_synced=1without any data migration.
5. Data-Flow Trace
| Consumed Data | Upstream Source | Flows Correctly? |
|---|---|---|
channel_synced=0 in reconcile query |
markChannelPendingTx β tx.Update SET channel_synced=0 inside CreateGroup tx, before commit |
β β committed atomically with group row |
channel_synced=1 flip |
MarkChannelSynced β session.Update SET channel_synced=1 after IM create success |
β β failure is non-blocking, reconcile retries |
Subscribers in reconcile IM create |
GetSubscribableMemberUIDs β querySubscribableMemberUIDsWithGroupNo (is_deleted=0 AND status=Normal) |
β β excludes blacklist/deleted, same source as 1module.go Subscribers callback |
groupNos in reconcile scan |
QueryReconcilableGroupNos β SELECT group_no FROM group WHERE channel_synced=0 AND status=1 AND created_at < grace |
β β correctly filters by status and grace window |
imCreateChannel injection |
Service.imCreateChannel field, default = ctx.IMCreateOrUpdateChannel in NewService |
β β tests inject mock, production uses real IM call |
DB default channel_synced=1 for non-CreateGroup paths |
Migration DEFAULT 1 + Model struct has no ChannelSynced field β AttrToUnderscore omits it |
β β verified event.go:173, event.go:315, service.go:194 all use Model-based Insert |
6. R5 Blind-Spot Checklist (security_sensitive)
- C1 β Dual-path parity: N/A. This PR adds a create-side safety flag; there is no symmetric "un-create" path. The compensating delete is best-effort and unchanged in structure. The reconcile worker only creates (never deletes) channels β asymmetric by design.
- C2 β Control-flow ordering / nesting reuse: Clear.
markChannelPendingTxruns inside the tx before commit (correct ordering).MarkChannelSyncedruns after IM success (correct). The Lua CAS-DEL lock release is the standard safe pattern. Thefmt.SprintfSQL interpolation uses%dwith anintβ safe for current code but fragile (see P2-1). - C3 β Authorization boundary β capability boundary: N/A. The reconciler is a system-internal background worker with no user-facing endpoint. It operates on DB rows, not user requests. No new API surface is exposed.
- C4 β Authorization lifecycle / container-member cascade: N/A. This PR does not touch authorization or access control. The
channel_syncedflag is an internal operational flag, not an authorization gate.
7. Cross-Round Blocker Recheck
N/A β first review of this PR.
[Octo-Q] verdict: APPROVE
No P0/P1 findings. The PR correctly addresses the orphan-group problem with a well-designed two-layer defense (in-tx pending flag + idempotent reconcile worker). Data-flow tracing confirms all consumed data reaches its consumption point correctly. The two P2 findings (SQL interpolation style, Redis client leak) are maintainability improvements that don't block landing.
yujiawei
left a comment
There was a problem hiding this comment.
Code Review β PR #410 (octo-server)
Scope reviewed: head de9f843c83256ae41be23b8fe6474e4452fd391e against base main (merge-base fe8da596). 7 files, +736/-8. go build ./... green; the PR's new tests (TestChannelReconcile_*, TestCreateGroup_Success_MarksChannelSynced, TestCreateGroup_IMFail_CompensatingDeleteRemovesGroup) pass in isolation. (The full modules/group package hits a pre-existing Error 1040: Too many connections at test-server setup in an unrelated test β an environment connection-cap issue, not introduced by this PR.)
This was reviewed with extra care as a security-sensitive change. The new Redis tick-lock is the only security-relevant surface and it is implemented correctly (SET NX EX acquire + token-aware Lua CAS-DEL release), matching the established pattern in modules/oidc.
Verdict: APPROVED
The change is sound and strictly improves on the status quo: a create-time IM failure used to leave a permanently orphaned, queryable group; it now leaves a detectable, self-healing one (channel_synced=0 β reconcile). The migration is backward-compatible (DEFAULT 1, guarded/idempotent procedure pattern), the insert/flag are correctly transactional, and the design (forward-recovery via an idempotent create-or-update) is internally consistent and well-tested.
No P0/P1 blocker was confirmed. The findings below are P2 hardening items and one design decision that merits explicit human product sign-off (this PR carries needs-human-review).
Findings
β Reconcile vs. teardown race β reconciler can recreate a channel for a group being torn down (P2)
modules/group/channel_reconcile.go:236-267
QueryReconcilableGroupNos filters status = GroupStatusNormal, so disbanded groups are normally excluded. But within a single tick there is a TOCTOU window: a candidate selected at scan time can be disbanded (or its rollback delete can land) between selection and the imCreateChannel call, after which the worker recreates the IM channel from a stale member snapshot and treats the subsequent zero-row MarkChannelSynced as success.
- Impact: a stray IM channel for a group that should no longer have one. No data loss, bounded to one tick's per-item window, and the precondition is an already-rare orphan.
- Suggested hardening: re-read
status/channel_synced(ideally a conditionalUPDATE ... WHERE channel_synced=0 AND status=Normal) immediately before recreating, and act onRowsAffectedfromMarkChannelSyncedinstead of discarding it.
β‘ Compensating-delete vs. concurrent recovery β unconditional delete can orphan a just-recovered channel (P2)
modules/group/service.go:1291-1315
If the CreateGroup IM-create call blocks past the grace window (β₯120s) and then returns an error, another instance's reconciler may have already recreated the channel and flipped channel_synced=1. The failure branch then deletes the rows unconditionally, leaving the recovered IM channel orphaned and returning a false "create failed."
- Impact: reverse-orphan (channel without group) plus a misleading error, only under extreme IM latency (>grace). Narrow and degradation-only.
- Suggested hardening: make the compensating delete conditional (
DELETE ... WHERE channel_synced=0), so it no-ops once a row has been recovered.
β’ Worker lifecycle is never shut down (P2)
modules/group/api.go:90-122, modules/group/channel_reconcile.go:188-193
startChannelReconciler() launches a ticker goroutine with context.Background() and constructs a dedicated Redis client, but the group module registers no Stop in its register.Module (modules/group/1module.go), so neither ChannelReconciler.Stop() nor redisReconcileLock.Close() is ever called on graceful shutdown / module reload.
Note: the in-code rationale ("main.go only calls module.Setup, not module.Start") is inaccurate β server.Start() does call module.Start(ctx), and the cited modules/oidc precedent it follows registers Start: o.Init, Stop: o.Close and cleans up both its worker goroutine and its Redis client in Close(). Tests are unaffected (the cfg.Test guard early-returns, so no goroutine/connection leak in the suite); the leak is production-shutdown-only (process is exiting anyway), hence P2 rather than a blocker.
- Suggested fix: mirror the oidc precedent β register
Stopon the group module, have it callreconciler.Stop()and close the lock's Redis client.
β£ Start() mutates cancel/wg without a mutex (P2, defensive)
modules/group/channel_reconcile.go:220-246
Start() documents itself as safe to call repeatedly but mutates r.cancel/r.wg unsynchronized. In the current wiring it is only ever called once, so this is latent β worth a mutex if it ever becomes re-callable from multiple goroutines.
Design decision to confirm (human sign-off)
The reconciler implements forward-recovery: a group whose creation returned an error to the caller (commit succeeded, IM-create + compensating-delete both failed) is later completed into a fully working group by the worker, rather than cleaned up. This is deliberate and documented, and it satisfies the stated acceptance criteria of the linked issue. The alternative β cleanup-forward (delete the orphan, honoring the failure the user already saw) β is also defensible. Because this is the central product trade-off of the change and the PR is flagged for human review, the owner should explicitly confirm "resurrect" over "clean up" is the intended semantics.
Notes on the review process
Three independent review passes were run and reconciled. One cross-pass finding ("the group insert commits outside the transaction with the default channel_synced=1") was rejected on verification: service.go:1129 uses s.db.InsertTx(&Model{...}, tx), and markChannelPendingTx runs in the same tx, so insert + flag commit atomically β there is no non-transactional pre-commit window. A second pass framed findings β /β‘ as P0 data-corruption; on evidence review they are narrow, bounded, non-data-loss edge cases that improve on the baseline (which had permanent orphans), so they are recorded here as P2.
Coverage / not assessed from the diff
- Runtime idempotency of
ctx.IMCreateOrUpdateChannelis taken on the PR's word (create-or-update) β not verified against the WuKongIM-side implementation here. - Production behavior of the Redis tick-lock under real multi-instance contention and Redis failover was reasoned about, not exercised.
- Metric cardinality is fine (fixed label sets, no per-group labels);
/metricsexposure wiring is out of this diff.
|
Heads-up on the red |
|
One additional non-blocking note (does not change the approval): |
Summary
Fixes #394.
group.CreateGroupcreated the IM channel aftertx.Commit(); on IM-create failure it ran a best-effort, non-atomic compensatingDELETEthat only logged. A DB blip on that delete β or a crash between commit and IM-create β left an orphan group with no backing IM channel, permanently queryable.channel_syncedflag (migration,DEFAULT 1for backward compat).CreateGroupwrites new rows as0in-tx and flips to1only after a confirmed IM channel create. Existing rows / other insert paths keep the DB default1, so they're never mis-detected as orphans.channel_synced=0instead of silently permanent.channel_synced=0rows older than a grace window, re-creates the IM channel (IMCreateOrUpdateChannelis create-or-update β idempotent) and flips the flag. Optional Redis tick-lock dedups across instances; lock-free fallback stays correct via idempotency. Disabled undercfg.Test; interval/grace tunable viaDM_GROUP_CHANNEL_RECONCILE_INTERVAL_SEC/_GRACE_SEC.Serviceso failure modes are unit-testable.Acceptance (#394)
API contract
No REST/WS contract change.
channel_syncedis internal; the group-create response shape is unchanged.Test plan
TestCreateGroup_Success_MarksChannelSyncedβ happy path flips flag to 1TestCreateGroup_IMFail_CompensatingDeleteRemovesGroupβ IM-fail + delete-OK leaves no orphanTestChannelReconcile_RecreatesChannelAndFlipsFlagβ crash-between-commit-and-IM / delete-fail residue recoveredTestChannelReconcile_Idempotentβ second run is a no-opTestChannelReconcile_RespectsGraceWindowβ in-flight creates not racedTestChannelReconcile_SkipsDisbandedGroupβ disbanded channels not recreatedTestChannelReconcile_IMFailLeavesOrphanForRetryβ flag stays 0 for retry on IM failurego test ./modules/group/green;go build ./...greenπ€ OCT-16 (sub-task of OCT-10)