Skip to content

Merging to release-5.8: [TT-15954]: Make org session fetch non-blocking (#7531)#7582

Merged
maciejwojciechowski merged 1 commit intorelease-5.8from
merge/release-5.8/30461c446125a55f6b44798aed33183c46954964/TT-15954
Dec 2, 2025
Merged

Merging to release-5.8: [TT-15954]: Make org session fetch non-blocking (#7531)#7582
maciejwojciechowski merged 1 commit intorelease-5.8from
merge/release-5.8/30461c446125a55f6b44798aed33183c46954964/TT-15954

Conversation

@probelabs
Copy link
Copy Markdown
Contributor

@probelabs probelabs Bot commented Dec 2, 2025

User description

TT-15954: Make org session fetch non-blocking (#7531)

Description

Fixes request pipeline blocking when MDCB is unavailable by making org
session fetches non-blocking in RPC mode.

Related Issue

TT-15954

Motivation and Context

When MDCB is unavailable, synchronous RPC calls to fetch org sessionsin
OrganizationMonitor were blocking the request pipeline for 90-120
seconds

How This Has Been Tested

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing
    functionality to change)
  • Refactoring or add test (improvements in base code or adds test
    coverage to functionality)

Checklist

  • I ensured that the documentation is up to date
  • I explained why this PR updates go.mod in detail with reasoning
    why it's required
  • I would like a code coverage CI quality gate exception and have
    explained why

Ticket Details

TT-15954
Status In Code Review
Summary Request pipeline blocked by synchronous RPC calls every 10
minutes when MDCB unavailable

Generated at: 2025-11-20 13:06:06


Co-authored-by: andrei-tyk 97896463+andrei-tyk@users.noreply.github.com


PR Type

Bug fix


Description

  • Make org session fetch non-blocking in RPC

  • Add singleflight to dedupe session fetches

  • Async refresh for org session expiry cache

  • Add tests for async behavior and RPC mode


Diagram Walkthrough

flowchart LR
  ProcReq["OrganizationMonitor.ProcessRequest"] -- "RPC mode, cache miss" --> RefreshBG["refreshOrgSession (async)"]
  ProcReq -- "non-RPC mode, miss" --> SyncFetch["OrgSession (sync)"]
  RefreshBG -- "populate cache or set OrgHasNoSession" --> Cache["SessionCache"]
  OrgExpiry["BaseMiddleware.OrgSessionExpiry"] -- "cache hit" --> ReturnExp["return cached expiry"]
  OrgExpiry -- "cache miss" --> ExpiryBG["refreshOrgSessionExpiry (async)"]
  ExpiryBG -- "found session" --> SetExp["SetOrgExpiry(DataExpires)"]
  ExpiryBG -- "not found/error" --> SetDefault["SetOrgExpiry(DEFAULT)"]
  RPCMock["Mock gorpc server"] -- "slow GetKey" --> Tests["Async RPC tests"]
Loading

File Walkthrough

Relevant files
Tests
mw_organisation_activity_test.go
Tests for async org session fetch and RPC mode                     

gateway/mw_organisation_activity_test.go

  • Add tests for async org session refresh.
  • Implement mock gorpc server for slow RPC.
  • Verify requests don't block in RPC mode.
  • Validate OrgHasNoSession handling.
+227/-0 
middleware_test.go
Tests for async org session expiry refresh                             

gateway/middleware_test.go

  • Extend OrgSessionExpiry tests for async path.
  • Add background fetch assertions with delays.
  • Cover non-existent org default behavior.
+35/-12 
Bug fix
mw_organisation_activity.go
Non-blocking org session fetch with singleflight                 

gateway/mw_organisation_activity.go

  • Introduce singleflight group for org session fetch.
  • Add async refreshOrgSession with cache populate.
  • Make RPC mode fetch non-blocking on cache miss.
  • Minor comment fix for off-thread path.
+29/-2   
middleware.go
Async org expiry refresh and emergency defaults                   

gateway/middleware.go

  • Return default on expiry cache miss immediately.
  • Add async refreshOrgSessionExpiry using singleflight.
  • Short-circuit to default in emergency mode.
  • Cache defaults when session not found.
+21/-15 
Miscellaneous
coverage.out
Add coverage report artifact                                                         

gateway/coverage.out

  • Add coverage output file.
+9299/-0

<!-- Provide a general summary of your changes in the Title above -->

## Description

<!-- Describe your changes in detail -->
Fixes request pipeline blocking when MDCB is unavailable by making org
session fetches non-blocking in RPC mode.

## Related Issue

<!-- This project only accepts pull requests related to open issues. -->
<!-- If suggesting a new feature or change, please discuss it in an
issue first. -->
<!-- If fixing a bug, there should be an issue describing it with steps
to reproduce. -->
<!-- OSS: Please link to the issue here. Tyk: please create/link the
JIRA ticket. -->
[TT-15954](https://tyktech.atlassian.net/browse/TT-15954)

## Motivation and Context

<!-- Why is this change required? What problem does it solve? -->
When MDCB is unavailable, synchronous RPC calls to fetch org sessionsin
OrganizationMonitor were blocking the request pipeline for 90-120
seconds

## How This Has Been Tested

<!-- Please describe in detail how you tested your changes -->
<!-- Include details of your testing environment, and the tests -->
<!-- you ran to see how your change affects other areas of the code,
etc. -->
<!-- This information is helpful for reviewers and QA. -->

## Screenshots (if appropriate)

## Types of changes

<!-- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Refactoring or add test (improvements in base code or adds test
coverage to functionality)

## Checklist

<!-- Go over all the following points, and put an `x` in all the boxes
that apply -->
<!-- If there are no documentation updates required, mark the item as
checked. -->
<!-- Raise up any additional concerns not covered by the checklist. -->

- [ ] I ensured that the documentation is up to date
- [ ] I explained why this PR updates go.mod in detail with reasoning
why it's required
- [ ] I would like a code coverage CI quality gate exception and have
explained why

[TT-15954]:
https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

<!---TykTechnologies/jira-linter starts here-->

### Ticket Details

<details>
<summary>
<a href="https://tyktech.atlassian.net/browse/TT-15954" title="TT-15954"
target="_blank">TT-15954</a>
</summary>

|         |    |
|---------|----|
| Status  | In Code Review |
| Summary | Request pipeline blocked by synchronous RPC calls every 10
minutes when MDCB unavailable |

Generated at: 2025-11-20 13:06:06

</details>

<!---TykTechnologies/jira-linter ends here-->

---------

Co-authored-by: andrei-tyk <97896463+andrei-tyk@users.noreply.github.com>
(cherry picked from commit 30461c4)
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 2, 2025

🚨 Jira Linter Failed

Commit: 858c6b1
Failed at: 2025-12-02 10:48:04 UTC

The Jira linter failed to validate your PR. Please check the error details below:

🔍 Click to view error details
failed to validate Jira issue: jira ticket TT-15954 has status 'In Test' but must be one of: In Dev, In Code Review, Ready For Dev, Dod Check

Next Steps

  • Ensure your branch name contains a valid Jira ticket ID (e.g., ABC-123)
  • Verify your PR title matches the branch's Jira ticket ID
  • Check that the Jira ticket exists and is accessible

This comment will be automatically deleted once the linter passes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 2, 2025

API Changes

no api changes detected

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 2, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅

7531 - PR Code Verified

Compliant requirements:

  • Make org session fetching non-blocking when running in RPC mode so request pipeline is not blocked if MDCB is unavailable/slow.
  • Avoid long synchronous waits (90-120s) caused by org session retrieval in OrganizationMonitor.
  • Ensure behavior remains correct when session exists, does not exist, and under concurrent requests.
  • Provide tests that validate non-blocking behavior and correctness.

Requires further human verification:

  • Validate in an environment with MDCB instability that real-world latency does not cause hidden contention beyond unit tests.
  • Confirm operational logging and observability are sufficient to troubleshoot background fetch outcomes in production.
⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Concurrency

The singleflight use for org session fetches lacks context cancellation and timeout; background goroutines may stack if OrgSession itself blocks. Validate that OrgSession respects timeouts in RPC mode and that singleflight covers all call sites without memory leaks.

func (k *OrganizationMonitor) refreshOrgSession(orgID string) {
	orgSessionFetchGroup.Do(orgID, func() (interface{}, error) {
		session, found := k.OrgSession(orgID)
		if found && !k.Spec.GlobalConfig.LocalSessionCache.DisableCacheSessionState {
			sessionLifeTime := session.Lifetime(k.Spec.GetSessionLifetimeRespectsKeyExpiration(), k.Spec.SessionLifetime, k.Gw.GetConfig().ForceGlobalSessionLifetime, k.Gw.GetConfig().GlobalSessionLifetime)

			k.Gw.SessionCache.Set(orgID, session.Clone(), sessionLifeTime)
			k.Logger().Debug("Background org session fetch completed for: ", orgID)
			return session, nil
		}
		if !found {
			k.setOrgHasNoSession(true)
		}
		return nil, nil
	})
}
Behavior Change

OrgSessionExpiry now always returns default immediately on cache miss and refreshes asynchronously. Ensure downstream callers tolerate default until cache warms, and verify no regressions where immediate accurate expiry was previously expected.

func (t *BaseMiddleware) OrgSessionExpiry(orgid string) int64 {
	t.Logger().Debug("Checking: ", orgid)

	if rpc.IsEmergencyMode() {
		return DEFAULT_ORG_SESSION_EXPIRATION
	}

	// Try to get from cache first
	cachedVal, found := t.Gw.ExpiryCache.Get(orgid)
	if found {
		return cachedVal.(int64)
	}

	// Start async refresh in background
	go t.refreshOrgSessionExpiry(orgid)

	return DEFAULT_ORG_SESSION_EXPIRATION
}

func (t *BaseMiddleware) refreshOrgSessionExpiry(orgid string) {
	orgSessionExpiryCache.Do(orgid, func() (interface{}, error) {
		s, found := t.OrgSession(orgid)
		if found && t.Spec.GlobalConfig.EnforceOrgDataAge {
			t.SetOrgExpiry(orgid, s.DataExpires)
			return s.DataExpires, nil
		}
		// On failure or if not found, cache the default value
		t.SetOrgExpiry(orgid, DEFAULT_ORG_SESSION_EXPIRATION)
		return DEFAULT_ORG_SESSION_EXPIRATION, nil
	})
}
Missing Backoff/Retry

Background refresh on RPC miss has no retry/backoff or jitter; repeated misses could cause periodic load. Consider adding minimal backoff or rate limiting if OrgHasNoSession remains true.

	// try to get from Redis
	if !found {
		// not found in in-app cache, let's read from Redis
		// RPC mode: start background fetch and allow request to proceed
		if k.Spec.GlobalConfig.SlaveOptions.UseRPC {
			go k.refreshOrgSession(k.Spec.OrgID)

			return nil, http.StatusOK
		}

		// Non-RPC mode: synchronous fetch
		orgSession, found = k.OrgSession(k.Spec.OrgID)
		if !found {
			// prevent reads from in-app cache and from Redis for next runs
			k.setOrgHasNoSession(true)
			return nil, http.StatusOK
		}
	}
	clone := orgSession.Clone()
	if k.Spec.GlobalConfig.ExperimentalProcessOrgOffThread {
		// Make a copy of request before sending to goroutine
		r2 := r.WithContext(r.Context())
		return k.ProcessRequestOffThread(r2, &clone)
	}
	return k.ProcessRequestLive(r, &clone)
}

func (k *OrganizationMonitor) refreshOrgSession(orgID string) {
	orgSessionFetchGroup.Do(orgID, func() (interface{}, error) {
		session, found := k.OrgSession(orgID)
		if found && !k.Spec.GlobalConfig.LocalSessionCache.DisableCacheSessionState {
			sessionLifeTime := session.Lifetime(k.Spec.GetSessionLifetimeRespectsKeyExpiration(), k.Spec.SessionLifetime, k.Gw.GetConfig().ForceGlobalSessionLifetime, k.Gw.GetConfig().GlobalSessionLifetime)

			k.Gw.SessionCache.Set(orgID, session.Clone(), sessionLifeTime)
			k.Logger().Debug("Background org session fetch completed for: ", orgID)
			return session, nil
		}
		if !found {
			k.setOrgHasNoSession(true)
		}
		return nil, nil
	})
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 2, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix org session fetch consistency

Ensure orgSessionFetchGroup.Do uses a unique key per gateway instance to avoid
cross-org collision if orgID can overlap in different contexts, and always clear the
"no session" flag when a session is found. Also, log when a session is not found to
aid diagnostics.

gateway/mw_organisation_activity.go [105-120]

 func (k *OrganizationMonitor) refreshOrgSession(orgID string) {
-	orgSessionFetchGroup.Do(orgID, func() (interface{}, error) {
+	key := "org_session_fetch:" + orgID
+	orgSessionFetchGroup.Do(key, func() (interface{}, error) {
 		session, found := k.OrgSession(orgID)
-		if found && !k.Spec.GlobalConfig.LocalSessionCache.DisableCacheSessionState {
-			sessionLifeTime := session.Lifetime(k.Spec.GetSessionLifetimeRespectsKeyExpiration(), k.Spec.SessionLifetime, k.Gw.GetConfig().ForceGlobalSessionLifetime, k.Gw.GetConfig().GlobalSessionLifetime)
-
-			k.Gw.SessionCache.Set(orgID, session.Clone(), sessionLifeTime)
+		if found {
+			// Clear the no-session flag if previously set
+			k.setOrgHasNoSession(false)
+			if !k.Spec.GlobalConfig.LocalSessionCache.DisableCacheSessionState {
+				sessionLifeTime := session.Lifetime(
+					k.Spec.GetSessionLifetimeRespectsKeyExpiration(),
+					k.Spec.SessionLifetime,
+					k.Gw.GetConfig().ForceGlobalSessionLifetime,
+					k.Gw.GetConfig().GlobalSessionLifetime,
+				)
+				k.Gw.SessionCache.Set(orgID, session.Clone(), sessionLifeTime)
+			}
 			k.Logger().Debug("Background org session fetch completed for: ", orgID)
 			return session, nil
 		}
-		if !found {
-			k.setOrgHasNoSession(true)
-		}
+		k.Logger().Debug("Background org session fetch: no session found for org: ", orgID)
+		k.setOrgHasNoSession(true)
 		return nil, nil
 	})
 }
Suggestion importance[1-10]: 6

__

Why: Using a singleflight key prefix and clearing OrgHasNoSession when a session is found improves consistency and reduces risk of stale flags; added logging is helpful. However, changing the singleflight key format may be unnecessary since orgID is already unique within a process, and could hinder dedup with existing in-flight calls.

Low
General
Prevent duplicate expiry refreshes

Guard the background call to avoid launching multiple concurrent refreshes on cache
miss, which can cause thundering herds. Use the existing orgSessionExpiryCache.Do as
the singleflight guard and avoid returning default forever if EnforceOrgDataAge is
false by caching a sensible value.

gateway/middleware.go [371-389]

 func (t *BaseMiddleware) OrgSessionExpiry(orgid string) int64 {
 	t.Logger().Debug("Checking: ", orgid)
 
 	if rpc.IsEmergencyMode() {
 		return DEFAULT_ORG_SESSION_EXPIRATION
 	}
 
-	// Try to get from cache first
-	cachedVal, found := t.Gw.ExpiryCache.Get(orgid)
-	if found {
+	if cachedVal, found := t.Gw.ExpiryCache.Get(orgid); found {
 		return cachedVal.(int64)
 	}
 
-	// Start async refresh in background
-	go t.refreshOrgSessionExpiry(orgid)
+	// Use singleflight to avoid spawning multiple refresh goroutines
+	go func(id string) {
+		orgSessionExpiryCache.Do(id, func() (interface{}, error) {
+			s, found := t.OrgSession(id)
+			if found && t.Spec.GlobalConfig.EnforceOrgDataAge {
+				t.SetOrgExpiry(id, s.DataExpires)
+				return s.DataExpires, nil
+			}
+			// If not enforcing data age, cache default to prevent repeated refresh attempts
+			t.SetOrgExpiry(id, DEFAULT_ORG_SESSION_EXPIRATION)
+			return DEFAULT_ORG_SESSION_EXPIRATION, nil
+		})
+	}(orgid)
 
 	return DEFAULT_ORG_SESSION_EXPIRATION
 }
Suggestion importance[1-10]: 5

__

Why: The idea to guard against duplicate refreshes is reasonable, but wrapping refreshOrgSessionExpiry with another singleflight.Do inside a goroutine duplicates logic already in the new refreshOrgSessionExpiry method, adding complexity without clear benefit.

Low
Avoid redundant RPC fetch spawns

In RPC mode, avoid launching redundant background fetches on every request when the
session is already being fetched or known missing. Check getOrgHasNoSession() before
spawning and deduplicate with a singleflight key to reduce goroutine churn.

gateway/mw_organisation_activity.go [78-96]

 if !found {
 	// not found in in-app cache, let's read from Redis
-	// RPC mode: start background fetch and allow request to proceed
 	if k.Spec.GlobalConfig.SlaveOptions.UseRPC {
-		go k.refreshOrgSession(k.Spec.OrgID)
-
+		// Only trigger background fetch when we don't already know there's no session
+		if !k.getOrgHasNoSession() {
+			go func(orgID string) {
+				key := "org_session_fetch:" + orgID
+				orgSessionFetchGroup.Do(key, func() (interface{}, error) {
+					k.refreshOrgSession(orgID)
+					return nil, nil
+				})
+			}(k.Spec.OrgID)
+		}
 		return nil, http.StatusOK
 	}
 
 	// Non-RPC mode: synchronous fetch
 	orgSession, found = k.OrgSession(k.Spec.OrgID)
 	if !found {
-		// prevent reads from in-app cache and from Redis for next runs
 		k.setOrgHasNoSession(true)
 		return nil, http.StatusOK
 	}
 }
Suggestion importance[1-10]: 4

__

Why: Avoiding redundant RPC fetches is a good goal, but the proposed code nests singleflight.Do around a call that itself uses singleflight, and changes the dedup key, which is unnecessary and may prevent dedup. Checking getOrgHasNoSession() before spawning is fine but offers limited impact since current refreshOrgSession already sets the flag.

Low

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Dec 2, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
73.9% Coverage on New Code (required ≥ 80%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@maciejwojciechowski maciejwojciechowski merged commit 4289fb1 into release-5.8 Dec 2, 2025
38 of 41 checks passed
@maciejwojciechowski maciejwojciechowski deleted the merge/release-5.8/30461c446125a55f6b44798aed33183c46954964/TT-15954 branch December 2, 2025 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants