Skip to content

[BUG] RBAC role/user_role seeder race when fast-path skips advisory lock #4482

@gandhipratik203

Description

@gandhipratik203

Summary

After PR #4444 lands the L1 fast-path in bootstrap_db.main(), the post-migration bootstrap (_run_post_migration_bootstrap) runs without holding the migration advisory lock on every replica that takes the fast path. Two of the seed steps it runs are race-prone read-then-insert patterns with no DB-tier uniqueness:

  1. bootstrap_default_roles()RoleService.create_role() for each of the 5 system roles
  2. RoleService.assign_role_to_user() for the platform_admin assignment

Role.id and UserRole.id are client-generated UUIDs, so PK collision never saves us. Concurrent fast-path workers can each pass the existence check and each insert. Once duplicates exist, the next get_role_by_name / get_user_role_assignment raises MultipleResultsFound and downstream RBAC checks return 500.

Trigger conditions

  • Multi-replica gateway deployment OR multi-worker single-pod gunicorn
  • Schema already at Alembic head (so the fast-path probe returns true)
  • One of: migration.enabled=false in the chart, or non-Helm deployments where each gateway pod runs its own bootstrap
  • First startup of a fresh DB (when system role rows do not yet exist)

The chart default (migration.enabled=true + MCPGATEWAY_SKIP_MIGRATIONS=true on gateway pods) is not affected — the migration Job is the only writer in that configuration.

Impact

Original reviewer text (verbatim from PR #4444)

2. High: the new fast-path makes system-role bootstrap non-serialised, but role creation is still a read-then-insert with no database uniqueness. mcpgateway/bootstrap_db.py:139 and mcpgateway/bootstrap_db.py:782 now run bootstrap_default_roles() outside the advisory lock; that function still does get_role_by_name() then create_role() at mcpgateway/bootstrap_db.py:512 and mcpgateway/services/role_service.py:196, while the roles table has no (name, scope) uniqueness in mcpgateway/db.py:1126 or the original migration. Two workers recovering a half-bootstrapped DB can both create platform_admin/team_admin/etc., and once duplicates exist, mcpgateway/services/role_service.py:251 starts throwing MultipleResultsFound.

3. Medium: the same unlocked path can also duplicate the admin's platform_admin assignment. mcpgateway/bootstrap_db.py:547 does another read-then-insert through mcpgateway/services/role_service.py:677 and mcpgateway/services/role_service.py:604, but user_roles also has no uniqueness guard in mcpgateway/db.py:1168 or its migration. If the role exists but the assignment is missing, concurrent fast-path workers can insert multiple active rows, and later lookups will fail the same way with MultipleResultsFound.

Proposed fix (two-layer)

  • DB tier — partial unique indexes WHERE is_active = true:
    • roles(name, scope)
    • user_roles(user_email, role_id, scope) for scope_id IS NULL
    • user_roles(user_email, role_id, scope, scope_id) for scope_id IS NOT NULL
    • (split for nullable scope_id because Postgres and SQLite both treat NULL as distinct in plain unique indexes)
  • App tier — savepoint + refetch: wrap inserts in db.begin_nested(); on IntegrityError, rollback the savepoint and return the winner's row instead of surfacing the error.
  • Alembic migration: idempotent dedupe-then-constrain — soft-delete duplicate active rows (oldest by created_at / granted_at wins) before adding the indexes so the migration is safe on databases that have already raced. Audit history preserved.

Implemented in PR #4480 (draft, base = PR #4444's branch — auto-retargets to main after #4444 merges).

Acceptance criteria

  • Multi-replica gateway startup on a fresh DB cannot produce duplicate active roles or user_roles rows
  • The fix is enforced at the DB tier (partial unique index), not just in application code
  • Existing deployments with pre-existing duplicate active rows can still apply the migration without manual cleanup
  • Integration test pins the invariant under real concurrency

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingsecurityImproves securitytriageIssues / Features awaiting triage

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions