Skip to content

fix: collation repair for space_member/user tables (#482)#489

Draft
lml2468 wants to merge 1 commit into
mainfrom
fix/issue-482-collation-mismatch
Draft

fix: collation repair for space_member/user tables (#482)#489
lml2468 wants to merge 1 commit into
mainfrom
fix/issue-482-collation-mismatch

Conversation

@lml2468

@lml2468 lml2468 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

根因 / 实现说明

  • queryMembers (modules/space/db.go:158) 的 LEFT JOIN user_verification uv ON uv.user_id=sm.uidutf8mb4_0900_ai_ci 默认 collation 的部署上触发 Error 1267(space_member 继承 DB 默认 collation,user_verification 被 compat-repair 迁移强制为 general_ci
  • 原迁移 (20260512) 的修复清单遗漏了 space_memberuser

改动摘要

  • 新增迁移 20260627000001_base_collation_repair_space_user.sql:将 space_memberuser 转换为 utf8mb4_general_ci(存储过程模式,幂等,遵循 20260512 的已有模式)
  • 防御性 COLLATEqueryMembersuv JOIN 条件加 COLLATE utf8mb4_general_ci,在迁移运行前也能正常工作

测试

  • 构建验证: go build ./...
  • 复现测试: 需要 0900_ai_ci 环境的 MySQL 实例(CI 的 general_ci 环境无法覆盖)
  • 验证 SQL: SELECT TABLE_NAME, TABLE_COLLATION FROM information_schema.TABLES WHERE TABLE_SCHEMA = DATABASE() AND TABLE_NAME IN ('space_member','user_verification','user');

验证 (baseline diff, charter §6)

  • go build ./...: before = pass, after = pass, delta = 0 new failures
  • 实际端点验证需在非规范 collation 部署上进行

Fixes #482

The original compat-repair migration (20260512) converted user_verification
to utf8mb4_general_ci but missed space_member and user. On deployments where
the DB default is utf8mb4_0900_ai_ci, the queryMembers LEFT JOIN on
user_verification triggers Error 1267 (Illegal mix of collations), causing
the member list endpoint to return 500.

Two-part fix:
1. New migration adds space_member and user to the collation repair
   (root cause — eliminates the collation asymmetry for all JOINs)
2. Explicit COLLATE utf8mb4_general_ci on the uv JOIN condition as
   defense-in-depth (survives until the migration runs)

Fixes #482
Jerry-Xin
Jerry-Xin previously approved these changes Jun 27, 2026

@Jerry-Xin Jerry-Xin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE — collation repair correctly makes the joined columns match on both sides; idempotent, ordering-safe, no FK/uniqueness/data-loss risk.

This PR addresses the MySQL 8 collation drift behind #482 / #420. On deployments where the DB default is utf8mb4_0900_ai_ci, space_member (created without an explicit COLLATE) inherited that default, while user_verification is explicitly utf8mb4_general_ci, so queryMembers' LEFT JOIN user_verification raised Error 1267 (Illegal mix of collations).

Core correctness (verified at head 051815722a160a3ef722ba274026f8f1b0e280b9):

  • user_verification DDL (modules/user/sql/20260505000003_user_legacy01.sql) is explicitly COLLATE=utf8mb4_general_ci.
  • The migration converts space_member and userutf8mb4_general_ci, i.e. it aligns the previously-drifted side to the join counterpart. Both sides of the sm.uid = uv.user_id comparison now resolve to general_ci. ✅ The repair makes the joined columns match, not just change one side.
  • space_member source DDL has no explicit COLLATE (confirms the inheritance bug); user source DDL is already general_ci, so the user ALTER is a no-op on standard installs and only acts on drifted ones.

🔴 Blocking — None.

💬 Non-blocking

🟡 modules/space/db.go:161 — The defensive COLLATE utf8mb4_general_ci pin only covers the uv join. The adjacent LEFT JOIN user u ON u.uid=sm.uid is not pinned, so in the narrow window of "new binary, migration not yet applied, drifted schema" the user join could still hit Error 1267 before the migration repairs space_member. Non-blocking: the new migration fixes both tables and runs at normal startup; this only affects the migrations-disabled edge case.

🔵 modules/base/sql/20260627000001_base_collation_repair_space_user.sql — Consider a lightweight migration-content guard test (as the repo does elsewhere) pinning that both space_member and user stay in the repair list.

✅ Highlights

  • Idempotent: each ALTER is gated by an INFORMATION_SCHEMA.TABLES collation check inside a stored procedure, mirroring the established 20260512000001_base_oss_compat_repair.sql pattern; safe to re-run (sql-migrate also tracks it once in gorp_migrations).
  • Version 20260627000001 is unique and lexicographically ordered after 20260614, so it runs after the legacy user / space_member / user_verification creation migrations (global timestamp ordering).
  • No FK constraints exist on space_member/user, so no FK-collation ALTER failure. CONVERT TO CHARACTER SET utf8mb4→utf8mb4 is metadata-only at row level; general_ci folds no more aggressively than 0900_ai_ci on the ASCII UID keys, so no duplicate-key collision on the UNIQUE(space_id,uid) index.

⚠️ Operational note (matches the prior repair's own caveat): ALTER TABLE ... CONVERT TO CHARACTER SET still triggers a COPY rebuild + table-level metadata lock; on a very large space_member/user table, apply off-hours. Not a code blocker.

@OctoBoooot OctoBoooot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix: collation repair for space_member/user tables (#489)

Verdict: Needs author input — one load-bearing question (symmetric collation risk on the robot JOIN) that, if unaddressed, turns this fix into the inverse of the bug it fixes. PR is a draft, so this is feedback-while-in-progress, not a merge gate. CI green.

Risk tier: schema migration touching the user hot table. Fix shape (idempotent stored-proc + defensive COLLATE) is sound; the open question is whether the space_member collation flip creates a new mismatch on a sibling JOIN.

Headline question — does flipping space_member break the robot JOIN? ([bidirectional-edge-construction])

The bug: space_member (inherits DB default, possibly utf8mb4_0900_ai_ci) JOINed to user_verification (forced general_ci by 20260512) → Error 1267. The fix flips space_membergeneral_ci.

But queryMembers (modules/space/db.go:158-164) JOINs space_member to three tables:

LEFT JOIN user u ON u.uid=sm.uid                                          -- user: fixed by this migration ✓
LEFT JOIN user_verification uv ON uv.user_id=sm.uid COLLATE general_ci    -- defensive COLLATE added ✓
LEFT JOIN robot r ON r.robot_id=sm.uid                                    -- robot: NOT pinned, NOT in repair list ⚠

The robot table is NOT in 20260512's repaired list (that list has robot_apply, not robot) and is NOT converted by this migration. If robot currently inherits 0900_ai_ci (same default that caused the original bug), then:

  • Before this PR: space_member(0900) JOIN robot(0900) = same collation = OK.
  • After this PR: space_member(general_ci) JOIN robot(0900) = NEW Error 1267 on the robot JOIN.

So the fix could move the 1267 from the uv JOIN to the robot JOIN — the symmetric bug. The defensive COLLATE was added only to uv, not robot. I couldn't byte-confirm robot's collation from its legacy DDL (modules/robot/sql/*_robot_legacy01.sql — no explicit COLLATE found, which means it inherits the DB default = the danger case).

Please confirm: on a 0900_ai_ci deployment, after this migration, does space_member ⨝ robot still match? If robot isn't general_ci, either (a) add robot to this migration's convert list, or (b) add the same defensive COLLATE utf8mb4_general_ci to the robot JOIN. The whole-table convert of space_member changes its collation relative to every table it joins, not just uv — so the fix needs to be verified against all three JOIN partners, not just the one that reported the error.

Verified — the parts that are right

  • Idempotent stored-proc migration follows the established 20260512 pattern (skip-if-already-target via TABLE_COLLATION check, DROP PROCEDURE before+after). Won't double-convert on re-run.
  • Defensive COLLATE on the uv JOIN (db.go:162) makes the fix work before the migration runs — good "works during the rollout window" thinking. Comment names the issue (#482) and the rationale.
  • Description-vs-diff ([description-vs-diff-crosscheck]): the body claims 20260512 "missed space_member and user" — verified, the 20260512 list (lines 50-66) does NOT contain space_member or user (it has user_verification, user_api_key, user_pinned_channel, etc. but not bare user/space_member). Claim accurate.
  • Fixes #482 links the P0 regression; the diff plausibly addresses it (collation asymmetry on the queryMembers JOIN).

Major — user is a hot table; CONVERT TO locks it

ALTER TABLE user CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci rewrites every row and holds a metadata lock for the duration — on a large user table this can be a multi-second-to-minutes write-blocking operation during migration. Per skill §3 (migration safety), this warrants either:

  • A note in the PR body on expected user table size + lock duration in the target deployment, OR
  • Confirmation it runs in a maintenance window.
    CONVERT TO CHARACTER SET is NOT an INSTANT operation (unlike an end-of-table ADD COLUMN) — it's a full table rebuild. For space_member (likely smaller) the risk is lower; for user it's the real operational cost. Not a code defect, but the migration's blast radius should be stated.

Praise

  • Defensive COLLATE + migration belt-and-suspenders — the author didn't just add the migration (which fixes it after it runs) OR just the query COLLATE (which fixes it without a schema change); they did both, so the fix works during the rollout window (query-level) AND eliminates the asymmetry permanently (schema-level). That's the right shape for a collation bug that bites mid-deploy.
  • Idempotent stored-proc pattern reused, not reinvented — follows 20260512's exact structure (TABLE_COLLATION guard, MAX() lookup form). Consistency with the existing repair migration means the same operational expectations apply.
  • Root-cause writeup names the exact mechanism — DB-default 0900_ai_ci vs forced general_ci on user_verification, the specific JOIN, the Error 1267, the #420 precedent. A reviewer can verify the diagnosis without re-deriving it.

Recommended before un-drafting

  1. Resolve the robot JOIN question — confirm robot's collation post-migration, and either convert it or pin its JOIN. This is the symmetric-bug risk and the load-bearing item.
  2. Add a one-line note on user table CONVERT lock duration / maintenance-window expectation.
  3. Optional: sweep the other space_member ⨝ user/robot JOINs (bot_api/groups.go:335,340, search/api.go:316, space/db.go:355) — they share the same tables and could hit the same 1267 on a 0900 deployment; worth confirming they're covered by the user/space_member conversion.

@mochashanyao mochashanyao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Octo-Q · automated review]

Verdict: Request changes — blocking findings below (data-flow traced).


octo-server PR#489 Review Report

Reviewer: Octo-Q (automated review)
Head SHA: 051815722a160a3ef722ba274026f8f1b0e280b9
PR Title: fix: collation repair for space_member/user tables (#482)
Changed files: 2 (+58 −1)


1. Verification Summary

Item Status Evidence
Migration SQL syntax Stored procedure with idempotent IF guard, MAX(TABLE_COLLATION) avoids shared-handler pitfall
Migration pattern fidelity Follows 20260512000001_base_oss_compat_repair.sql pattern (proc + DROP before/after + no-op Down)
COLLATE placement in queryMembers sm.uid COLLATE utf8mb4_general_ci correctly pins comparison collation; coercibility 1 > 2, so general_ci wins
Defensive COLLATE before migration runs Both sides resolve to general_ci — no Error 1267
Migration after it runs (uv JOIN) Both space_member and user_verification at general_ci — no Error 1267
Migration after it runs (user JOIN) Both space_member and user at general_ci — no Error 1267
Migration after it runs (robot JOIN) NEW mismatch introduced — see P1 below

2. Findings

🔴 P1 — Migration converts space_member to general_ci but excludes robot table → breaks space_member JOIN robot on MySQL 8 default deployments

Severity: P1 (blocking)
Diff-scope: NEW — introduced by this PR's migration. Before the migration both tables are at DB default (same collation → JOIN works). After, they diverge.

Root cause analysis (data-flow trace)

  1. robot table DDL (modules/robot/sql/20210926000001_robot_legacy01.sql:6):

    create table `robot` (
      ...
      robot_id VARCHAR(40) not null default '',
      ...
    );

    No explicit CHARSET or COLLATE — inherits DB default.

  2. 20260512 repair (modules/base/sql/20260512000001_base_oss_compat_repair.sql:57-63): repair list covers 17 tables including robot_apply, but robot is NOT in the list. No subsequent migration adds COLLATE to robot.

  3. After this PR's migration runs (on MySQL 8 default utf8mb4_0900_ai_ci):

    • space_memberutf8mb4_general_ci (converted by this PR)
    • robotutf8mb4_0900_ai_ci (inherits DB default, NOT repaired)
  4. MySQL coercibility rule: both sides have coercibility 2 (string columns with implicit collations). Different collations at equal coercibility → Error 1267: Illegal mix of collations.

Affected queries (ALL currently working → ALL broken after migration)

Query File:Line JOIN Impact
queryMembers modules/space/db.go:163 LEFT JOIN robot r ON r.robot_id=sm.uid Member list endpoint
querySpaceBots modules/space/db.go:354 INNER JOIN robot r ON r.robot_id=sm.uid Bot list in Space
bot listing modules/robot/api.go:1290 INNER JOIN robot r ON r.robot_id = sm.uid Robot API
integration lookup modules/integration/db.go:159 INNER JOIN space_member sm ON sm.uid=r.robot_id Integration
integration lookup modules/integration/db.go:269,298 same pattern Integration
botfather check modules/botfather/db.go:78 INNER JOIN space_member sm ON sm.uid=r.robot_id BotFather

Note: queryMembers at line 162 has defensive COLLATE on the uv JOIN but NOT on the robot JOIN at line 163.

R1 test

This makes currently working production paths unavailable (member listing, bot listing, botfather, integrations all return 500). On any MySQL 8 deployment using utf8mb4_0900_ai_ci as default, the migration creates exactly the same class of bug (#482) it's trying to fix — just shifted to a different JOIN pair. Worse than doing nothing for those paths.

Fix options (pick one)

Option A — Include robot in the repair migration (recommended, simplest):
Add a third block to the stored procedure:

  -- Repair robot
  SELECT MAX(TABLE_COLLATION) INTO v_collation
    FROM information_schema.TABLES
    WHERE TABLE_SCHEMA = DATABASE() AND TABLE_NAME = 'robot';
  IF v_collation IS NOT NULL AND v_collation <> 'utf8mb4_general_ci' THEN
    ALTER TABLE `robot` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
  END IF;

Option B — Defensive COLLATE on ALL space_member JOIN robot (belt-and-suspenders):
Same pattern as the uv JOIN fix, but applied to every r.robot_id=sm.uid / sm.uid=r.robot_id across the codebase.

Option C — Verify actual production robot collation before merging:

SELECT TABLE_NAME, TABLE_COLLATION
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = DATABASE()
  AND TABLE_NAME IN ('space_member', 'user', 'robot', 'user_verification');

If robot is already utf8mb4_general_ci (e.g. from an undocumented manual ALTER), no code change needed — but document this in the PR description to close the question.


ℹ️ P2 — Defensive COLLATE only covers queryMembers, not other space_member JOINs (pre-migration window)

Severity: P2 (non-blocking)
Diff-scope: Pre-existing (these JOINs existed before this PR; PR adds protection only for one)

On deployments with MySQL 8 default before the migration runs:

  • space_member JOIN user in searchMembers (db.go:183), countMembers (db.go:204), search/api.go:316, bot_api/groups.go:335,340 — currently safe because BOTH tables inherit DB default.
  • space_member JOIN robot in querySpaceBots (db.go:354) — currently safe, same reason.

These are not broken today, but the asymmetric defensive coverage means the PR's "defense-in-depth before migration" guarantee is incomplete. If a future change converts user or robot to general_ci (without converting space_member first), the unprotected JOINs break.

Recommendation: Consider a broader sweep of defensive COLLATEs on all cross-table uid JOINs involving space_member, or merge the migration first and verify before removing COLLATEs.


3. Recommendations

  1. [Must-fix before merge] Add robot to the repair migration (Option A above), or verify its production collation and document (Option C). Without this, the migration is a net negative for bot-related queries.
  2. [Should-fix] Consider adding defensive COLLATE to the robot JOIN in queryMembers (db.go:163) for the same belt-and-suspenders rationale as the uv JOIN fix.
  3. [Nit] Migration comment at line 7 says "The original repair converted 17 tables … but missed space_member and user" — consider also mentioning robot here (either "also missed robot, verified safe" or "repairing robot too").

4. Additional Findings

  • bot_mention_pref (20260603000001_bot_mention_pref.sql:14) IS created with COLLATE=utf8mb4_general_ci — evidence that newer tables are being properly pinned. The gap is specifically in older tables that predate explicit collation practices.
  • The COLLATE utf8mb4_general_ci in queryMembers is a runtime-only hint — it does NOT alter the column definition. Safe to keep permanently or remove after migration is confirmed on all deployments.

5. Data-Flow Trace

Consumed data Upstream source Flows correctly?
sm.uid in uv JOIN space_member.uid column (DB default → general_ci after migration) ✅ COLLATE pins to general_ci
uv.user_id in uv JOIN user_verification.user_id (explicit general_ci from creation DDL)
u.uid in user JOIN user.uid (DB default → general_ci after migration) ✅ after migration; safe before (both DB default)
r.robot_id in robot JOIN robot.robot_id (DB default, NOT repaired) ❌ after migration: sm.uid=general_ci, r.robot_id=0900_ai_ci → Error 1267
sm.space_id in WHERE space_member.space_id ✅ not affected by collation change
Migration v_collation information_schema.TABLES.TABLE_COLLATION via MAX() ✅ handles missing tables (NULL check)

6. Blind-Spot Checklist (R5)

  • C1 — Dual-path parity: N/A. Migration is unidirectional (no Down logic, consistent with 20260512 pattern). No subscribe/unsubscribe or create/delete symmetry involved.
  • C2 — Control-flow ordering / nesting reuse: N/A. COLLATE is a single-expression modifier; no multi-call-site or nesting risk.
  • C3 — Authorization boundary ≠ capability boundary: N/A. Schema DDL change + read-query collation fix. No auth/authz implications.
  • C4 — Authorization lifecycle / container-member cascade: N/A. No authorization changes.
  • C5 — Build/note ≠ runtime path: HIT — The migration passes go build and static SQL review, but the runtime path for space_member JOIN robot breaks after migration on MySQL 8 default deployments. Build-time validation does not catch collation mismatches. The PR description notes "实际端点验证需在非规范 collation 部署上进行" — correct, but the gap is that the migration itself creates a new mismatch, not just fails to fix an old one.
  • C6 — Governance / policy doc consistency: N/A. Not a governance/policy PR.

7. Cross-Round Blocker Re-Check (R6)

N/A — first review of this PR.


[Octo-Q] verdict: REQUEST_CHANGES

Reasoning: The migration converts space_member to utf8mb4_general_ci but excludes robot (which also inherits DB default, confirmed from source DDL). After migration, all space_member JOIN robot queries (queryMembers:163, querySpaceBots:354, robot/api.go:1290, integration/db.go:159/269/298, botfather/db.go:78) will hit Error 1267 on MySQL 8 default deployments — these are currently working paths that the migration makes unavailable (R1). Fix: add robot to the repair procedure, or verify its production collation before merge.

@Jerry-Xin Jerry-Xin dismissed their stale review June 27, 2026 12:05

Retracting my APPROVE: the migration converts space_member+user to general_ci but not robot, which queryMembers/searchMembers also join on sm.uid; on MySQL-8 deployments this moves Error 1267 to the robot join. Re-reviewing as REQUEST_CHANGES.

@Jerry-Xin Jerry-Xin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: REQUEST_CHANGES — retracting my earlier APPROVE. I missed the symmetric-collation regression that OctoBoooot and mochashanyao correctly flagged. After reading the live source I can confirm it is real and PR-introduced.

🔴 Blocking — this migration moves Error 1267 from the uv join to the robot join (new regression)

modules/space/db.go queryMembers joins space_member to THREE tables on sm.uid:

  • LEFT JOIN user u ON u.uid=sm.uid (not COLLATE-pinned)
  • LEFT JOIN user_verification uv ON uv.user_id=sm.uid COLLATE utf8mb4_general_ci (pinned — this PR's fix)
  • LEFT JOIN robot r ON r.robot_id=sm.uid (not COLLATE-pinned)

searchMembers likewise joins space_member → user and space_member → robot, neither pinned.

The migration converts space_member and user to utf8mb4_general_ci, but robot is never converted — it is not in this migration, and it was not in the prior 20260512000001 repair (that list converts robot_apply, not robot; the only robot references there are an unrelated idx_robot_bot_token index repair). I checked the robot table DDL in modules/robot/sql/20210926000001_robot_legacy01.sql: create table \robot`is declared with **no explicitCOLLATE/CHARACTER SET**, so it inherits the DB default — exactly like space_member/user` did before this PR.

Consequence on a MySQL-8 deployment (DB default utf8mb4_0900_ai_ci):

  • Before this PR: space_member, user, robot all inherit 0900_ai_cispace_member ⋈ robot MATCHES (the uv join was the only broken one, because user_verification was explicitly converted to general_ci by 20260512 — that is the #482 bug).
  • After this PR: space_membergeneral_ci while robot stays 0900_ai_cispace_member ⋈ robot becomes a NEW Error 1267: Illegal mix of collations, throwing 500 on queryMembers and searchMembers.

So as written the migration does not eliminate the collation fault — it relocates it from the verification join to the robot join.

Fix (either is sufficient; (a) preferred for completeness):

  • (a) Add robot to the converted-table list in this migration (same idempotent INFORMATION_SCHEMA-gated CONVERT TO ... utf8mb4_general_ci proc), so all of space_member's join partners (user, user_verification, robot) are uniformly general_ci. Sweep queryMembers/searchMembers for any other sm.uid join partner and include them too.
  • (b) Or pin COLLATE utf8mb4_general_ci on the robot (and user) join conditions in both queryMembers and searchMembers, mirroring the uv pin — defense-in-depth for the pre-migration window.

🟡 Non-blocking (carried from prior pass)

  • The user join in queryMembers is also unpinned; the migration converts user so it is fixed at startup, but during the rollout window (new binary, migration not yet applied) space_member ⋈ user is unprotected. The (b) pin would close this too.
  • ALTER ... CONVERT TO CHARACTER SET rebuilds the table under a metadata lock (no ALGORITHM=INPLACE/INSTANT); fine at current sizes and consistent with the prior migration, but worth an off-hours/maintenance-window note for large user tables.
  • A lightweight migration content-preservation test (assert the repair list keeps the intended tables) would guard re-run stability and catch a future dropped table.

✅ What is correct

  • The idempotent stored-proc pattern (INFORMATION_SCHEMA collation check, skip-if-already-target, DROP PROCEDURE IF EXISTS before/after) is sound and reuses the established 20260512 recipe.
  • The uv-join COLLATE pin and its rationale comment are correct for that join.
  • utf8mb4 → utf8mb4 conversion is metadata-only at the row level (no truncation); general_ci vs 0900_ai_ci are both case/accent-insensitive on the ASCII UID keys, so no new uniqueness collision on UNIQUE(space_id,uid); no FK constraints on these tables, so no FK-collation ALTER failure.

The fix direction is right and the migration scaffolding is solid — it just needs to cover all of space_member's join partners (the robot table) so the conversion completes the repair instead of shifting it. Credit to OctoBoooot and mochashanyao for catching the robot-join regression.

@yujiawei yujiawei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #489 (octo-server)

Reviewed at head SHA 051815722a160a3ef722ba274026f8f1b0e280b9 against main.

Summary

The PR correctly diagnoses #482: on MySQL 8 deployments whose DB default is utf8mb4_0900_ai_ci, space_member and user were created without an explicit COLLATE and so inherited 0900_ai_ci, while the 20260512 compat-repair migration forced user_verification to utf8mb4_general_ci. That asymmetry is what throws Error 1267 on the space_member ⋈ user_verification join in queryMembers.

The fix is the right idea but incomplete in a way that moves the same error to a new join instead of removing it. space_member is joined to three tables in this code path, not one.

Blocking issue (P1) — converting space_member/user introduces a new Error 1267 on the robot join

queryMembers joins space_member to user, user_verification, and robot:

modules/space/db.go:162  LEFT JOIN user u ON u.uid=sm.uid
modules/space/db.go:163  LEFT JOIN user_verification uv ON uv.user_id=sm.uid COLLATE utf8mb4_general_ci
modules/space/db.go:164  LEFT JOIN robot r ON r.robot_id=sm.uid

This migration converts space_member and user to utf8mb4_general_ci, but robot is not converted:

  • New migration converts only space_member and usermodules/base/sql/20260627000001_base_collation_repair_space_user.sql:26-40
  • The 20260512 repair cursor list does not include robot (it only repairs a robot index, not the table collation) — modules/base/sql/20260512000001_base_oss_compat_repair.sql:56-67 and :98-115
  • robot source DDL has no explicit charset/collation, so it also inherits 0900_ai_ci on MySQL 8 — modules/robot/sql/20210926000001_robot_legacy01.sql:6-17

Net effect on a 0900_ai_ci deployment:

  • Before this PR: space_member, user, robot are all 0900_ai_ci → the robot join is fine; only the user_verification join is broken (= #482).
  • After this PR: space_member/user become general_ci while robot stays 0900_ai_ciLEFT JOIN robot r ON r.robot_id=sm.uid now throws Error 1267.

So the migration trades one collation mismatch for another, and the defensive COLLATE was added only to the uv join (db.go:163), not the robot join (db.go:164) — so it does not protect this path either.

The same space_member/userrobot pairing exists in several other queries that will break the same way after the migration:

  • querySpaceBotsmodules/space/db.go:354 (INNER JOIN robot r ON r.robot_id=sm.uid)
  • searchMembers builds the same space_member+robot join — modules/space/db.go:184
  • Space bot listing — modules/robot/api.go:1289-1290
  • Integration bot → space lookup — modules/integration/db.go:159
  • BotFather robots-by-creator-in-space — modules/botfather/db.go:78
  • userrobot lookups: modules/user/db_pinned.go:72, modules/group/bot_ownership.go:68, modules/user/api.go:4043, modules/bot_api/obo_db.go:1626

Recommended fix

Two coherent options; either resolves the P1:

  1. Add robot to the table-conversion set (and audit any other *_id/uid join partner of space_member/user). This is the durable fix — it makes the join graph collation-consistent. Note the next-domino risk below.
  2. Pin COLLATE utf8mb4_general_ci on every column-column equality that crosses these tables (the robot join in queryMembers/querySpaceBots/searchMembers, etc.), matching what was done for uv. This survives the pre-migration window but is easy to miss a site.

Preferably both: convert the tables for correctness, and keep the defensive COLLATE on the hot path joins for the deploy-ordering window.

Next-domino risk if you only add robot (P2)

A table-only fix that adds just robot can shift the mismatch to other legacy UID tables that also lack an explicit collation and are joined to robot/user, e.g. group_member and friend:

  • group_member.uid = robot.robot_idmodules/robot/db.go:224
  • friend.to_uid = user.uid / robot.robot_idmodules/robot/api.go:1401-1402

The safest path is a single coherent collation pass over all UID-join tables rather than table-by-table patching.

Other findings (non-blocking)

P2 — migration ordering creates a transient split-brain window. The procedure alters space_member before user. During that window space_member is general_ci while user is still 0900_ai_ci, so any concurrent space_member ⋈ user query (e.g. queryMembers) can throw Error 1267 mid-migration. modules/base/sql/20260627000001_base_collation_repair_space_user.sql:27-40. This is transient (clears once both ALTERs finish) but worth noting for prod rollout.

P2 — locking / online-DDL. ALTER TABLE ... CONVERT TO CHARACTER SET rebuilds the table and holds a metadata lock; on large user/space_member tables this can block writes and induce replication lag. The 20260512 migration already documents this and tells operators to run off-hours — carry the same operator note here, and consider ALGORITHM=INPLACE, LOCK=NONE where supported. modules/base/sql/20260627000001_base_collation_repair_space_user.sql:28,36.

P2 — idempotency granularity. The guard reads TABLE_COLLATION from information_schema.TABLES. If a prior partial/manual DDL changed the table default but left individual VARCHAR columns (e.g. uid) at the old collation, the script skips the table and leaves the join still broken. Checking information_schema.COLUMNS for the join columns would be more robust. modules/base/sql/20260627000001_base_collation_repair_space_user.sql:23-26,31-34.

What I could not verify

  • No live MySQL 0900_ai_ci instance to reproduce Error 1267 end-to-end; analysis is static (matches the PR's own note that CI runs general_ci and can't cover this).
  • Production table sizes for user/space_member (lock-time impact) are unknown.
  • I traced the main space_member/user join partners but did not claim exhaustive coverage of every *_id equality in the repo.

Verdict

The migration is well-structured and idempotent, and the diagnosis is correct, but as written it relocates Error 1267 to the space_member/userrobot joins on the exact MySQL 8 deployments it targets, affecting queryMembers, querySpaceBots, and several bot/integration paths. Requesting changes to bring robot (and any other UID-join partner) into the same collation and/or pin COLLATE on the remaining cross-collation joins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🔴 [P0] queryMembers user_verification JOIN 在非规范 collation 部署上 Error 1267 → 成员列表 500 (#420 回归)

5 participants