Summary
After PR #4444 lands the L1 fast-path in bootstrap_db.main(), the post-migration bootstrap (_run_post_migration_bootstrap) runs without holding the migration advisory lock on every replica that takes the fast path. Two of the seed steps it runs are race-prone read-then-insert patterns with no DB-tier uniqueness:
bootstrap_default_roles() → RoleService.create_role() for each of the 5 system roles
RoleService.assign_role_to_user() for the platform_admin assignment
Role.id and UserRole.id are client-generated UUIDs, so PK collision never saves us. Concurrent fast-path workers can each pass the existence check and each insert. Once duplicates exist, the next get_role_by_name / get_user_role_assignment raises MultipleResultsFound and downstream RBAC checks return 500.
Trigger conditions
- Multi-replica gateway deployment OR multi-worker single-pod gunicorn
- Schema already at Alembic head (so the fast-path probe returns true)
- One of:
migration.enabled=false in the chart, or non-Helm deployments where each gateway pod runs its own bootstrap
- First startup of a fresh DB (when system role rows do not yet exist)
The chart default (migration.enabled=true + MCPGATEWAY_SKIP_MIGRATIONS=true on gateway pods) is not affected — the migration Job is the only writer in that configuration.
Impact
Original reviewer text (verbatim from PR #4444)
2. High: the new fast-path makes system-role bootstrap non-serialised, but role creation is still a read-then-insert with no database uniqueness. mcpgateway/bootstrap_db.py:139 and mcpgateway/bootstrap_db.py:782 now run bootstrap_default_roles() outside the advisory lock; that function still does get_role_by_name() then create_role() at mcpgateway/bootstrap_db.py:512 and mcpgateway/services/role_service.py:196, while the roles table has no (name, scope) uniqueness in mcpgateway/db.py:1126 or the original migration. Two workers recovering a half-bootstrapped DB can both create platform_admin/team_admin/etc., and once duplicates exist, mcpgateway/services/role_service.py:251 starts throwing MultipleResultsFound.
3. Medium: the same unlocked path can also duplicate the admin's platform_admin assignment. mcpgateway/bootstrap_db.py:547 does another read-then-insert through mcpgateway/services/role_service.py:677 and mcpgateway/services/role_service.py:604, but user_roles also has no uniqueness guard in mcpgateway/db.py:1168 or its migration. If the role exists but the assignment is missing, concurrent fast-path workers can insert multiple active rows, and later lookups will fail the same way with MultipleResultsFound.
Proposed fix (two-layer)
- DB tier — partial unique indexes
WHERE is_active = true:
roles(name, scope)
user_roles(user_email, role_id, scope) for scope_id IS NULL
user_roles(user_email, role_id, scope, scope_id) for scope_id IS NOT NULL
- (split for nullable
scope_id because Postgres and SQLite both treat NULL as distinct in plain unique indexes)
- App tier — savepoint + refetch: wrap inserts in
db.begin_nested(); on IntegrityError, rollback the savepoint and return the winner's row instead of surfacing the error.
- Alembic migration: idempotent dedupe-then-constrain — soft-delete duplicate active rows (oldest by
created_at / granted_at wins) before adding the indexes so the migration is safe on databases that have already raced. Audit history preserved.
Implemented in PR #4480 (draft, base = PR #4444's branch — auto-retargets to main after #4444 merges).
Acceptance criteria
Summary
After PR #4444 lands the L1 fast-path in
bootstrap_db.main(), the post-migration bootstrap (_run_post_migration_bootstrap) runs without holding the migration advisory lock on every replica that takes the fast path. Two of the seed steps it runs are race-prone read-then-insert patterns with no DB-tier uniqueness:bootstrap_default_roles()→RoleService.create_role()for each of the 5 system rolesRoleService.assign_role_to_user()for the platform_admin assignmentRole.idandUserRole.idare client-generated UUIDs, so PK collision never saves us. Concurrent fast-path workers can each pass the existence check and each insert. Once duplicates exist, the nextget_role_by_name/get_user_role_assignmentraisesMultipleResultsFoundand downstream RBAC checks return 500.Trigger conditions
migration.enabled=falsein the chart, or non-Helm deployments where each gateway pod runs its own bootstrapThe chart default (
migration.enabled=true+MCPGATEWAY_SKIP_MIGRATIONS=trueon gateway pods) is not affected — the migration Job is the only writer in that configuration.Impact
roles) / MEDIUM (user_roles) per reviewer triage on PR fix(bootstrap): improve startup reliability for multi-replica deploys #4444created_atOriginal reviewer text (verbatim from PR #4444)
Proposed fix (two-layer)
WHERE is_active = true:roles(name, scope)user_roles(user_email, role_id, scope)forscope_id IS NULLuser_roles(user_email, role_id, scope, scope_id)forscope_id IS NOT NULLscope_idbecause Postgres and SQLite both treat NULL as distinct in plain unique indexes)db.begin_nested(); onIntegrityError, rollback the savepoint and return the winner's row instead of surfacing the error.created_at/granted_atwins) before adding the indexes so the migration is safe on databases that have already raced. Audit history preserved.Implemented in PR #4480 (draft, base = PR #4444's branch — auto-retargets to
mainafter #4444 merges).Acceptance criteria
rolesoruser_rolesrows