Skip to content

[AAP-70854] Optimize JWT claims reconciliation performance#979

Open
john-westcott-iv wants to merge 2 commits intoansible:develfrom
john-westcott-iv:AAP-70854-jwt-claims
Open

[AAP-70854] Optimize JWT claims reconciliation performance#979
john-westcott-iv wants to merge 2 commits intoansible:develfrom
john-westcott-iv:AAP-70854-jwt-claims

Conversation

@john-westcott-iv
Copy link
Copy Markdown
Member

@john-westcott-iv john-westcott-iv commented Apr 13, 2026

Summary

  • Wrap save_user_claims() in no_reverse_sync() to suppress redundant HTTP calls back to Gateway during stub Org/Team creation triggered by get_or_create_resource() post_save signals
  • Wrap in transaction.atomic() so partial failures roll back cleanly instead of leaving users in an inconsistent permission state (especially important for services without ATOMIC_REQUESTS)
  • Add defer_role_evaluation() context manager in triggers.py that collects all affected ObjectRole instances and processes compute_object_role_permissions() in a single batch on context exit, reducing N individual compute+write cycles to one
  • Add timing and grant/removal counters to save_user_claims() and process_rbac_permissions() for production observability of claims reconciliation duration

Problem

When a user with many role assignments (e.g., superusers with hundreds of explicit assignments) authenticates via JWT to a service, save_user_claims() executes N individual give_permission()/remove_permission() calls. Each call triggers:

  1. An individual RoleEvaluation cache recompute (compute_object_role_permissions())
  2. Potential post_save signals from stub Org/Team creation that fire HTTP reverse sync calls back to Gateway

This caused timeouts during authentication and occasional permission loss when reconciliation didn't complete.

Changes

ansible_base/rbac/triggers.py

  • New _DeferredEvaluationState class and defer_role_evaluation() context manager
  • Modified update_after_assignment() to defer when evaluation is active
  • Supports nesting by saving/restoring previous state

ansible_base/rbac/claims.py

  • save_user_claims() now wrapped in no_reverse_sync(), transaction.atomic(), and defer_role_evaluation()
  • Added grant/removal counters and timing log

ansible_base/jwt_consumer/common/auth.py

  • Added timing breakdown in process_rbac_permissions(): fetch time, save time, total time

New Tests

  • test_app/tests/rbac/test_defer_role_evaluation.py — 14 tests covering context manager state, nesting, exception handling, deferred batching, and integration with give/remove permission
  • test_app/tests/rbac/test_save_user_claims_optimizations.py — 6 tests verifying atomicity, reverse sync suppression, deferred evaluation, and logging
  • test_app/tests/jwt_consumer/common/test_auth_timing.py — 4 tests verifying timing logs in process_rbac_permissions

Test plan

  • All 184 targeted tests pass (ran with py312-sqlite)
  • Existing test_claims.py, test_triggers.py, and test_auth.py tests continue to pass
  • CI pipeline passes
  • Verify timing logs appear in service logs during JWT authentication

Note: This PR was developed with assistance from Claude AI assistant.

Made with Cursor

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved robustness of role-based access control (RBAC) operations by detecting and handling stale data conditions in concurrent scenarios.
    • Enhanced transaction atomicity for permission management operations.
  • Performance

    • Added timing instrumentation to JWT claims reconciliation and permission synchronization for diagnostic insights.
  • Tests

    • Comprehensive test coverage added for RBAC reliability, caching behavior, and optimization features.

john-westcott-iv and others added 2 commits March 30, 2026 13:46
…er concurrency

compute_team_member_roles() collects ObjectRole IDs in a read phase then
writes them to the M2M provides_teams table.  A concurrent committed
transaction (e.g. org deletion) can delete those ObjectRoles between the
two phases, causing a foreign key violation on
dab_rbac_objectrole_provides_teams.

Wrap the M2M add() and RoleEvaluation bulk_create() calls in savepoints
so IntegrityError from stale FK references is caught without aborting the
outer transaction.  On failure, re-filter to only IDs that still exist
and retry.

Co-authored-by: Claude <[email protected]>
Made-with: Cursor
save_user_claims() previously executed N individual DB writes and
RoleEvaluation cache recomputes when reconciling permissions from
Gateway. For users with many role assignments this caused timeouts
and occasional permission loss.

Changes:
- Wrap save_user_claims() in no_reverse_sync() to suppress redundant
  HTTP calls back to Gateway during stub Org/Team creation
- Wrap in transaction.atomic() so partial failures roll back cleanly
- Add defer_role_evaluation() context manager that batches all
  RoleEvaluation cache updates into a single pass on context exit
- Add timing and grant/removal counters to save_user_claims() and
  process_rbac_permissions() for production observability

Co-authored-by: Claude <[email protected]>
Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

The changes introduce timing instrumentation for JWT claim reconciliation, add stale object-role foreign-key error recovery with retry logic in RBAC caching, wrap claims synchronization in transactions with grant/removal tracking, and implement a deferred role-evaluation context manager to batch cache updates.

Changes

Cohort / File(s) Summary
Timing Instrumentation
ansible_base/jwt_consumer/common/auth.py
Added time module import and monotonic timing tracking (fetch, save, total elapsed) for JWT claims-hash mismatch reconciliation, logged in a single info message after persistence.
RBAC Stale FK Recovery
ansible_base/rbac/caching.py
Added _is_stale_objectrole_fk() detector for PostgreSQL FK constraint violations, plus _safe_m2m_add() and _safe_bulk_create_evaluations() helpers that wrap operations in transactions, catch stale-FK errors, filter invalid references, and retry; updated compute_team_member_roles() and compute_object_role_permissions() to use these wrappers.
Transaction-Wrapped Claims Sync
ansible_base/rbac/claims.py
Wrapped entire save_user_claims() RBAC flow in transaction.atomic() with context managers no_reverse_sync() and defer_role_evaluation(); added grant/removal counters and summary logging with elapsed time.
Deferred Role Evaluation
ansible_base/rbac/triggers.py
Added defer_role_evaluation() context manager with internal _DeferredEvaluationState singleton; extended update_after_assignment() to accumulate object-role updates and team-recompute flags when deferral is active, then batch-execute on context exit.
Auth Timing Tests
test_app/tests/jwt_consumer/common/test_auth_timing.py
New pytest module validating timing log output from process_rbac_permissions() during successful reconciliation; asserts "Claims reconciliation" message includes user ID and timing components; verifies log is absent on cache hits, early returns, or gateway fetch exceptions.
Caching Error Handling Tests
test_app/tests/rbac/test_caching.py
New module testing stale-FK detection and recovery: validates _is_stale_objectrole_fk() for PostgreSQL FK errors, exercises _safe_m2m_add() retry-after-filtering and persistent-failure logging, and tests _safe_bulk_create_evaluations() selective retry of only valid evaluations.
Deferred Evaluation Tests
test_app/tests/rbac/test_defer_role_evaluation.py
New module verifying defer_role_evaluation() context manager enables/disables deferral, supports nesting, batches update_after_assignment() calls, and correctly batch-invokes compute_team_member_roles() and compute_object_role_permissions() on exit; includes integration tests for give_permission() and remove_permission() under deferred batching.
Save Claims Optimization Tests
test_app/tests/rbac/test_save_user_claims_optimizations.py
New module asserting save_user_claims() transactional atomicity (no partial writes on exception), verifying reverse_sync_enabled and _deferred_evaluation.enabled flags are correctly set during execution, and logging exactly one summary message per call with grant/removal counts and elapsed time.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.34% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: JWT claims reconciliation performance optimization through batching and transaction improvements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

DVCS PR Check Results:

PR appears valid (JIRA key(s) found)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ansible_base/rbac/triggers.py`:
- Around line 29-78: The current _deferred_evaluation is process-global and
flushes work on every context exit, causing cross-request races and incorrect
nested behavior; change defer_role_evaluation() to use request-scoped
(thread-local or contextvar) state with a nesting counter (e.g., add a depth
attribute on the state), have inner contexts merge their object_roles and
needs_team_recompute into the parent state instead of running computes, and only
when the outermost context exits successfully (no exception) run
compute_team_member_roles() and
compute_object_role_permissions(object_roles=...) and then clear the state;
update references to _deferred_evaluation in defer_role_evaluation() (and any
callers like update_after_assignment) to use the new thread-local/context-local
state and ensure computes are skipped if an exception occurred in the context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ff1120e3-8eeb-4c24-b29b-4a6d2d371467

📥 Commits

Reviewing files that changed from the base of the PR and between 20369e1 and 3c26729.

📒 Files selected for processing (8)
  • ansible_base/jwt_consumer/common/auth.py
  • ansible_base/rbac/caching.py
  • ansible_base/rbac/claims.py
  • ansible_base/rbac/triggers.py
  • test_app/tests/jwt_consumer/common/test_auth_timing.py
  • test_app/tests/rbac/test_caching.py
  • test_app/tests/rbac/test_defer_role_evaluation.py
  • test_app/tests/rbac/test_save_user_claims_optimizations.py

Comment on lines +29 to +78
class _DeferredEvaluationState:
"""Tracks whether RoleEvaluation cache updates should be deferred.

Used by defer_role_evaluation() to collect object_roles that need
recomputation and process them in a single batch on context exit.
"""

def __init__(self):
self.enabled = False
self.object_roles = set()
self.needs_team_recompute = False


_deferred_evaluation = _DeferredEvaluationState()


@contextmanager
def defer_role_evaluation():
"""Defer RoleEvaluation cache updates until the context exits.

During bulk operations like save_user_claims(), each give_permission() /
remove_permission() call triggers compute_object_role_permissions()
individually. This context manager collects all affected object_roles
and processes them in a single batch when the context exits, reducing
N individual compute+write cycles to one.

Supports nesting by saving and restoring previous state.
"""
previous_enabled = _deferred_evaluation.enabled
previous_roles = _deferred_evaluation.object_roles
previous_teams = _deferred_evaluation.needs_team_recompute

_deferred_evaluation.enabled = True
_deferred_evaluation.object_roles = set()
_deferred_evaluation.needs_team_recompute = False
try:
yield
finally:
deferred_roles = _deferred_evaluation.object_roles
deferred_teams = _deferred_evaluation.needs_team_recompute

_deferred_evaluation.enabled = previous_enabled
_deferred_evaluation.object_roles = previous_roles
_deferred_evaluation.needs_team_recompute = previous_teams

if deferred_teams:
compute_team_member_roles()
if deferred_roles:
compute_object_role_permissions(object_roles=deferred_roles)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Make deferred evaluation state request-scoped and flush only on the outermost successful exit.

_deferred_evaluation is process-global, so one thread can flip enabled while another thread is in update_after_assignment(), causing cross-request batching and skipped recomputes. The current nesting logic also computes on inner exit instead of merging the inner batch back into the outer context, so the advertised nested behavior is broken and exception paths still do work that the surrounding transaction may roll back. Use thread-local/context-local state here and only run compute_* when the outermost context exits without an error.

Suggested fix
+import threading
 from contextlib import contextmanager
 from typing import Union
 from uuid import UUID
@@
-class _DeferredEvaluationState:
+class _DeferredEvaluationState(threading.local):
     """Tracks whether RoleEvaluation cache updates should be deferred.
@@
 `@contextmanager`
 def defer_role_evaluation():
@@
     previous_enabled = _deferred_evaluation.enabled
     previous_roles = _deferred_evaluation.object_roles
     previous_teams = _deferred_evaluation.needs_team_recompute
@@
+    completed = False
     try:
         yield
+        completed = True
     finally:
         deferred_roles = _deferred_evaluation.object_roles
         deferred_teams = _deferred_evaluation.needs_team_recompute
 
         _deferred_evaluation.enabled = previous_enabled
         _deferred_evaluation.object_roles = previous_roles
         _deferred_evaluation.needs_team_recompute = previous_teams
 
-        if deferred_teams:
-            compute_team_member_roles()
-        if deferred_roles:
-            compute_object_role_permissions(object_roles=deferred_roles)
+        if previous_enabled:
+            previous_roles.update(deferred_roles)
+            _deferred_evaluation.needs_team_recompute = previous_teams or deferred_teams
+        elif completed:
+            if deferred_teams:
+                compute_team_member_roles()
+            if deferred_roles:
+                compute_object_role_permissions(object_roles=deferred_roles)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ansible_base/rbac/triggers.py` around lines 29 - 78, The current
_deferred_evaluation is process-global and flushes work on every context exit,
causing cross-request races and incorrect nested behavior; change
defer_role_evaluation() to use request-scoped (thread-local or contextvar) state
with a nesting counter (e.g., add a depth attribute on the state), have inner
contexts merge their object_roles and needs_team_recompute into the parent state
instead of running computes, and only when the outermost context exits
successfully (no exception) run compute_team_member_roles() and
compute_object_role_permissions(object_roles=...) and then clear the state;
update references to _deferred_evaluation in defer_role_evaluation() (and any
callers like update_after_assignment) to use the new thread-local/context-local
state and ensure computes are skipped if an exception occurred in the context.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@huffmanca huffmanca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this — the batching approach and TOCTOU guards are solid. One thing I noticed:

defer_role_evaluation() runs compute_* in two cases where it doesn't need to

The finally block in defer_role_evaluation() unconditionally calls compute_team_member_roles() / compute_object_role_permissions():

  1. On inner context exit when nested — if previous_enabled is True, the inner context flushes its roles immediately instead of merging them into the parent's set. This defeats the batching optimization (though nesting isn't used in any current call site, so it's theoretical today).

  2. On exception — since save_user_claims() wraps everything in transaction.atomic(), an exception rolls back the DB changes, but compute_* still runs against that rolled-back state. The results get corrected on next recompute, but it's throwaway work.

Neither is a correctness issue — compute_* is idempotent and the system self-corrects. But both are straightforward to fix by checking whether to merge vs. flush:

completed = False
try:
    yield
    completed = True
finally:
    deferred_roles = _deferred_evaluation.object_roles
    deferred_teams = _deferred_evaluation.needs_team_recompute
    _deferred_evaluation.enabled = previous_enabled
    _deferred_evaluation.object_roles = previous_roles
    _deferred_evaluation.needs_team_recompute = previous_teams

    if previous_enabled:
        previous_roles.update(deferred_roles)
        _deferred_evaluation.needs_team_recompute |= deferred_teams
    elif completed:
        if deferred_teams:
            compute_team_member_roles()
        if deferred_roles:
            compute_object_role_permissions(object_roles=deferred_roles)

@AlanCoding
Copy link
Copy Markdown
Member

Wrap save_user_claims() in no_reverse_sync() to suppress redundant HTTP calls back to Gateway during stub Org/Team creation triggered by get_or_create_resource() post_save signals

This is bananas. I can hardly believe this was actually happening. This is gargantuan red flag territory.

Maybe it would be better to cut out a smaller portion of this patch doing this stuff. Because the rest of it goes really deep, and slapping on no_reverse_sync is super easy to get moving, and would be good to measure improvement for independent of the rest. In terms of priority of the patch, there's an extremely clear cliff here, so splitting it up is often a good idea in those cases.

@AlanCoding
Copy link
Copy Markdown
Member

This also seems to directly overlap with my attempt at bulk re-computation here.

#840

You did defer_role_evaluation(), compare to bulk_rbac_caching() there. These are mostly trying to do the same thing, but there are 1 or 2 extremely critical logical differences. The other one throws up its hands and does the global compute_object_role_permissions() if any team roles change, whereas yours doesn't. That, and other things, need more thought about what gets parity with what already exists as the incremental update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants