Skip to content

PR: Fix bootstrap_resource_assignments race condition on concurrent pod restart#5003

Open
rakdutta wants to merge 7 commits into
mainfrom
bugfix/bootstrap-resource-assignments-uniqueviolation
Open

PR: Fix bootstrap_resource_assignments race condition on concurrent pod restart#5003
rakdutta wants to merge 7 commits into
mainfrom
bugfix/bootstrap-resource-assignments-uniqueviolation

Conversation

@rakdutta
Copy link
Copy Markdown
Collaborator

@rakdutta rakdutta commented Jun 2, 2026

Summary

closes #4993 - Prevents race conditions when multiple gateway pods concurrently assign orphaned resources during startup, while preserving the PgBouncer compatibility fix from #4051.

Problem

When multiple replicas restart simultaneously (e.g., Kubernetes rolling deployment), they race when assigning orphaned resources to the admin team, causing IntegrityError crashes due to unique constraint violations.

Why not use advisory locks?
Issue #4051 identified that advisory locks on the fast path cause indefinite hangs with PgBouncer in transaction pooling mode. Session-scoped locks get orphaned across backend handoffs, causing pods to spin for ~10 minutes until timeout. This was fixed in commit e4b245f by skipping advisory locks on the fast path.

Solution

Use per-row commits with IntegrityError exception handling instead of locks:

for resource in unassigned:
    resource.team_id = personal_team.id
    # ... set other fields ...
    
    try:
        db.commit()  # Per-row commit
        assigned_count += 1
    except IntegrityError:
        db.rollback()  # Another worker assigned it - skip gracefully
        continue

Key insight: Let the database enforce uniqueness constraints. Handle exceptions gracefully instead of preventing them with locks.

Changes

  • mcpgateway/bootstrap_db.py: Added IntegrityError import, changed batch commit to per-row commits with exception handling
  • tests/unit/mcpgateway/test_bootstrap_db.py: Updated test to verify fast path remains lock-free

Testing

Concurrency test (scripts/test_issue_4993_fix.sh):

  • 5 test runs × 30 concurrent workers × 30 orphaned resources
  • Result: ✅ 0 races detected (0/5 failures)

Unit tests:

Performance Impact

  • Old: ~3 database round-trips (batch commit)
  • New: ~N+2 round-trips (N per-row commits)
  • Impact: +5-10 seconds for 30 resources during startup
  • Acceptable because: Bootstrap-only code (not request path), N=0 after first boot
Approach Performance PgBouncer Safe Race Safe
Batch commit Fast (3 ops)
Advisory lock Fast (3 ops) ❌ (#4051)
Per-row commits Slower (N+2)

Deployment Impact

Related Issues

rakdutta added 2 commits June 2, 2026 12:32
…ock to prevent race conditions (fixes #4993)

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
@rakdutta rakdutta marked this pull request as ready for review June 2, 2026 07:15
@rakdutta rakdutta added bug Something isn't working ica ICA related issues labels Jun 2, 2026
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
@rakdutta rakdutta marked this pull request as draft June 2, 2026 08:48
… for resource assignments

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
@rakdutta rakdutta marked this pull request as ready for review June 2, 2026 09:15
rakdutta added 3 commits June 2, 2026 14:51
…se conflicts in per-row commits

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
…leanup, and conflict tracking

Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Signed-off-by: Rakhi Dutta <rakhibiswas@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ica ICA related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: bootstrap_resource_assignments UniqueViolation on concurrent pod restart

1 participant