Skip to content

[TT-15954]: Make org session fetch non-blocking#7531

Merged
lghiur merged 10 commits intomasterfrom
TT-15954-mdcb-stop-blocking
Nov 20, 2025
Merged

[TT-15954]: Make org session fetch non-blocking#7531
lghiur merged 10 commits intomasterfrom
TT-15954-mdcb-stop-blocking

Conversation

@imogenkraak
Copy link
Copy Markdown
Contributor

@imogenkraak imogenkraak commented Nov 10, 2025

Description

Fixes request pipeline blocking when MDCB is unavailable by making org session fetches non-blocking in RPC mode.

Related Issue

TT-15954

Motivation and Context

When MDCB is unavailable, synchronous RPC calls to fetch org sessionsin OrganizationMonitor were blocking the request pipeline for 90-120 seconds

How This Has Been Tested

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Refactoring or add test (improvements in base code or adds test coverage to functionality)

Checklist

  • I ensured that the documentation is up to date
  • I explained why this PR updates go.mod in detail with reasoning why it's required
  • I would like a code coverage CI quality gate exception and have explained why

Ticket Details

TT-15954
Status In Code Review
Summary Request pipeline blocked by synchronous RPC calls every 10 minutes when MDCB unavailable

Generated at: 2025-11-20 13:06:06

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 10, 2025

🎯 Recommended Merge Targets

Based on JIRA ticket TT-15954: Request pipeline blocked by synchronous RPC calls every 10 minutes when MDCB unavailable

Fix Version: Tyk 5.11.0

⚠️ Warning: Expected release branches not found in repository

Required:

  • master - No matching release branches found. Fix will be included in future releases.

Fix Version: Tyk 5.8.9

Required:

  • release-5.8 - Minor version branch for 5.8.x patches - required for creating Tyk 5.8.9
  • master - Main development branch - ensures fix is in all future releases

📋 Workflow

  1. Merge this PR to master first

  2. Cherry-pick to release branches by commenting on the merged PR:

    • /release to release-5.8
  3. Automated backport - The bot will automatically create backport PRs to the specified release branches

@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented Nov 10, 2025

🔍 Code Analysis Results

This PR addresses a critical performance issue where the request pipeline could be blocked for extended periods if the Multi-Data Center Bridge (MDCB) was unavailable. The fix makes the fetching of organization session data non-blocking when the gateway is operating in RPC mode, significantly improving resilience and availability.

Files Changed Analysis

  • gateway/mw_organisation_activity.go: The core logic is updated in OrganizationMonitor.ProcessRequest. In RPC mode, if an organization's session is not found in the local cache, the request is now allowed to proceed immediately. A background goroutine (refreshOrgSession) is spawned to fetch the session data asynchronously. A singleflight.Group is introduced to coalesce concurrent fetches for the same organization, preventing the "thundering herd" problem.
  • gateway/middleware.go: The OrgSessionExpiry method is similarly refactored. On a cache miss, it now returns a default expiry value and triggers an asynchronous background refresh, ensuring that session expiry checks do not block the request path.
  • gateway/mw_organisation_activity_test.go & gateway/middleware_test.go: The test suites have been substantially improved. New unit tests verify the asynchronous behavior by checking for immediate returns on cache misses and eventual cache population. A new integration test, TestOrganizationMonitor_AsyncRPCMode, has been added, which uses a mock RPC server to simulate MDCB latency and confirm that requests are not blocked, providing strong validation for the fix.

Architecture & Impact Assessment

  • What this PR accomplishes: It decouples the request processing pipeline from the real-time availability of the MDCB. This prevents synchronous RPC calls from blocking gateway workers during an MDCB outage, thereby enhancing system availability and fault tolerance.

  • Key technical changes introduced:

    1. Asynchronous Fetching: Replaces blocking, synchronous RPC calls for organization sessions with a non-blocking, asynchronous pattern in RPC mode.
    2. Fail-Open on Cache Miss: The system now temporarily operates with default values when a session is not cached, prioritizing availability. The first request for an uncached organization will bypass its policies (e.g., quotas) until the session is fetched.
    3. Request Coalescing: Implements golang.org/x/sync/singleflight to ensure only one background fetch is initiated per organization, even under high concurrent load.
  • Affected system components: The change primarily impacts the Tyk Gateway's middleware layer, specifically the OrganizationMonitor and any middleware that depends on organization session data (e.g., for authentication, quotas, rate limits). The behavior change is confined to deployments where RPC mode (SlaveOptions.UseRPC) is enabled.

Org Session Fetch Flow

sequenceDiagram
    participant Client
    participant Gateway
    participant OrganizationMonitor
    participant MDCB

    Client->>Gateway: API Request

    box Old (Blocking Flow)
        Gateway->>OrganizationMonitor: ProcessRequest (Cache Miss)
        OrganizationMonitor->>MDCB: GetOrgSession (Blocking RPC call)
        Note over OrganizationMonitor, MDCB: Request pipeline is blocked here if MDCB is down
        MDCB-->>OrganizationMonitor: Session Data / Timeout
        OrganizationMonitor-->>Gateway: Continue/Block Request
        Gateway-->>Client: API Response / Error
    end

    box New (Non-Blocking Flow in RPC Mode)
        Gateway->>OrganizationMonitor: ProcessRequest (Cache Miss)
        OrganizationMonitor-->>Gateway: Return immediately (non-blocking)
        Gateway-->>Client: API Response (processed without org policy)
        par Background Fetch
            OrganizationMonitor->>MDCB: GetOrgSession
            MDCB-->>OrganizationMonitor: Session Data / Timeout
            OrganizationMonitor->>OrganizationMonitor: Update local cache for subsequent requests
        end
    end
Loading

Scope Discovery & Context Expansion

The change introduces a deliberate architectural trade-off: prioritizing availability over immediate consistency. By adopting a "fail-open" strategy for uncached sessions during an MDCB outage, the first request will bypass all organization-level policies, such as quotas and rate limits. Subsequent requests will use the correct session data once it has been fetched and cached. This behavior is isolated to gateways configured in RPC mode and is a strategic choice to enhance resilience against backend failures.

The impact of this change is broad within the gateway, affecting any feature that relies on OrgSession, including key authentication, OAuth, JWT, OIDC, and other mechanisms that enforce organization-level rules. The use of singleflight.Group is a critical safeguard to prevent overwhelming the backend with redundant requests when a popular organization's session expires from the cache.

Metadata
  • Review Effort: 3 / 5
  • Primary Label: bug

Powered by Visor from Probelabs

Last updated: 2025-11-19T12:17:47.227Z | Triggered by: synchronize | Commit: e2cbe2f

💡 TIP: You can chat with Visor using /visor ask <your question>

@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented Nov 10, 2025

🔍 Code Analysis Results

Security Issues (1)

Severity Location Issue
🟠 Error gateway/mw_organisation_activity.go:84
In RPC mode, if an organization's session is not found in the cache, the request is allowed to proceed without applying any organization-level rate limits or quotas. This creates a window of vulnerability where these controls are temporarily bypassed, potentially allowing for resource exhaustion or denial-of-service attacks against upstream services.
💡 SuggestionDo not allow the request to proceed without any checks. Instead, consider applying a temporary, restrictive default quota, or use the last known session data if available (even if stale), to ensure that security policies are always enforced. A complete bypass, even for a short time, undermines the purpose of organization-level quotas.

Architecture Issues (1)

Severity Location Issue
🟡 Warning gateway/mw_organisation_activity.go:107-121
The PR introduces two separate, uncoordinated mechanisms for asynchronously fetching organization sessions: `refreshOrgSession` in `OrganizationMonitor` and `refreshOrgSessionExpiry` in `BaseMiddleware` (gateway/middleware.go). Each uses its own `singleflight.Group` (`orgSessionFetchGroup` and `orgSessionExpiryCache` respectively). This duplicates logic and creates a potential race condition where two concurrent fetches for the same organization ID can be initiated if an API reload coincides with a live request. This could cause unnecessary load on the MDCB.
💡 SuggestionRefactor the logic into a single, centralized service or function responsible for asynchronously fetching and caching organization sessions. This function should use a single `singleflight.Group` to ensure only one fetch is in-flight per organization ID. Both `OrgSessionExpiry` and `OrganizationMonitor.ProcessRequest` should then use this centralized function to avoid redundant fetches and logic.

✅ Performance Check Passed

No performance issues found – changes LGTM.

Quality Issues (2)

Severity Location Issue
🟡 Warning gateway/middleware_test.go:69
The test relies on a fixed `time.Sleep` to wait for a background goroutine to complete. This can lead to flaky tests if the background operation takes longer than the sleep duration, for example on a heavily loaded CI runner. This pattern is used in multiple places in the new tests.
💡 SuggestionTo improve test reliability, replace fixed sleeps with a more deterministic synchronization mechanism. For unit tests, this might involve modifying the code to accept a `sync.WaitGroup` that can be used to signal completion. For integration tests, a better approach is to poll with a timeout for the expected state (e.g., the cache being populated) rather than waiting for a fixed duration.
🟡 Warning gateway/mw_organisation_activity.go:95-98
When an organization session is not found in the cache in RPC mode, the request proceeds while the session is fetched asynchronously. This means that for this request, and any others that arrive before the fetch completes, organization-level policies like quotas and rate limits are not enforced. This behavior is a deliberate trade-off for availability but is not obvious from the code alone.
💡 SuggestionAdd a code comment to explicitly state this trade-off. For example: `// Allow request to proceed while org session is fetched in the background. // Note: Organization-level rate limits and quotas will not be enforced until the fetch is complete.`

✅ Dependency Check Passed

No dependency issues found – changes LGTM.

Connectivity Issues (1)

Severity Location Issue
🟢 Info AI_RESPONSE:1
{ "text": "This PR addresses a critical performance issue where the request pipeline could be blocked for extended periods if the Multi-Data Center Bridge (MDCB) was unavailable. The fix makes the fetching of organization session data non-blocking when the gateway is operating in RPC mode, significantly improving resilience and availability.\ \ ## Files Changed Analysis\ \ - **`gateway/mw_organisation_activity.go`**: The core logic is updated in `OrganizationMonitor.ProcessRequest`. In RPC mode, if an organization's session is not found in the local cache, the request is now allowed to proceed immediately. A background goroutine (`refreshOrgSession`) is spawned to fetch the session data asynchronously. A `singleflight.Group` is introduced to coalesce concurrent fetches for the same organization, preventing the \\\"thundering herd\\\" problem.\ - **`gateway/middleware.go`**: The `OrgSessionExpiry` method is similarly refactored. On a cache miss, it now returns a default expiry value and triggers an asynchronous background refresh, ensuring that session expiry checks do not block the request path.\ - **`gateway/mw_organisation_activity_test.go` & `gateway/middleware_test.go`**: The test suites have been substantially improved. New unit tests verify the asynchronous behavior by checking for immediate returns on cache misses and eventual cache population. A new integration test, `TestOrganizationMonitor_AsyncRPCMode`, has been added, which uses a mock RPC server to simulate MDCB latency and confirm that requests are not blocked, providing strong validation for the fix.\ \ ## Architecture & Impact Assessment\ \ - **What this PR accomplishes**: It decouples the request processing pipeline from the real-time availability of the MDCB. This prevents synchronous RPC calls from blocking gateway workers during an MDCB outage, thereby enhancing system availability and fault tolerance.\ \ - **Key technical changes introduced**:\ 1. **Asynchronous Fetching**: Replaces blocking, synchronous RPC calls for organization sessions with a non-blocking, asynchronous pattern in RPC mode.\ 2. **Fail-Open on Cache Miss**: The system now temporarily operates with default values when a session is not cached, prioritizing availability. The first request for an uncached organization will bypass its policies (e.g., quotas) until the session is fetched.\ 3. **Request Coalescing**: Implements `golang.org/x/sync/singleflight` to ensure only one background fetch is initiated per organization, even under high concurrent load.\ \ - **Affected system components**: The change primarily impacts the Tyk Gateway's middleware layer, specifically the `OrganizationMonitor` and any middleware that depends on organization session data (e.g., for authentication, quotas, rate limits). The behavior change is confined to deployments where RPC mode (`SlaveOptions.UseRPC`) is enabled.```mermaid sequenceDiagram participant Client participant Gateway participant OrganizationMonitor participant MDCB
Client-&gt;&gt;Gateway: API Request

rect rgb(240, 240, 255)
    note over Client, MDCB: Old (Blocking Flow)
    Gateway-&gt;&gt;OrganizationMonitor: ProcessRequest (Cache Miss)
    OrganizationMonitor-&gt;&gt;MDCB: GetOrgSession (Blocking RPC call)
    Note over OrganizationMonitor, MDCB: Request pipeline is blocked here if MDCB is down
    MDCB--&gt;&gt;OrganizationMonitor: Session Data / Timeout
    OrganizationMonitor--&gt;&gt;Gateway: Continue/Block Request
    Gateway--&gt;&gt;Client: API Response / Error
end

rect rgb(240, 240, 255)
    note over Client, MDCB: New (Non-Blocking Flow in RPC Mode)
    Gateway-&gt;&gt;OrganizationMonitor: ProcessRequest (Cache Miss)
    OrganizationMonitor--&gt;&gt;Gateway: Return immediately (non-blocking)
    Gateway--&gt;&gt;Client: API Response (processed without org policy)
    par Background Fetch
        OrganizationMonitor-&gt;&gt;MDCB: GetOrgSession
        MDCB--&gt;&gt;OrganizationMonitor: Session Data / Timeout
        OrganizationMonitor-&gt;&gt;OrganizationMonitor: Update local cache for subsequent requests
    end
end
        end\
    end\
```\
\
## Scope Discovery &amp; Context Expansion\
\
The change introduces a deliberate architectural trade-off: prioritizing availability over immediate consistency. By adopting a \\\&#34;fail-open\\\&#34; strategy for uncached sessions during an MDCB outage, the first request will bypass all organization-level policies, such as quotas and rate limits. Subsequent requests will use the correct session data once it has been fetched and cached. This behavior is isolated to gateways configured in RPC mode and is a strategic choice to enhance resilience against backend failures.\
\
The impact of this change is broad within the gateway, affecting any feature that relies on `OrgSession`, including key authentication, OAuth, JWT, OIDC, and other mechanisms that enforce organization-level rules. The use of `singleflight.Group` is a critical safeguard to prevent overwhelming the backend with redundant requests when a popular organization&#39;s session expires from the cache.&#34;,
  &#34;tags&#34;: {
    &#34;review-effort&#34;: 3,
    &#34;label&#34;: &#34;bug&#34;
  }
}</div></td>
    </tr>
  </tbody>
</table>

---

*Powered by [Visor](https://probelabs.com/visor) from [Probelabs](https://probelabs.com)*

*Last updated: 2025-11-19T12:17:48.647Z | Triggered by: synchronize | Commit: e2cbe2f*

💡 **TIP:** You can chat with Visor using `/visor ask <your question>`
<!-- /visor-comment-id:pr-review-7531-review -->

@imogenkraak imogenkraak force-pushed the TT-15954-mdcb-stop-blocking branch from 1b71e10 to d935b4d Compare November 11, 2025 10:20
@imogenkraak imogenkraak marked this pull request as ready for review November 11, 2025 10:44
@github-actions
Copy link
Copy Markdown
Contributor

API Changes

no api changes detected

@lghiur
Copy link
Copy Markdown
Collaborator

lghiur commented Nov 12, 2025

/review is there a way to write integration tests for this?

@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented Nov 12, 2025

🔍 Code Analysis Results

Yes, it's possible to write an integration test for this change. The existing test framework is flexible enough to support it, although it would require setting up a mock gRPC server to simulate the MDCB backend.

Here is the high-level approach you could take:

  1. Configure the Gateway for RPC Mode in the Test: The gateway.StartTest helper accepts a configuration function. You can use this to enable RPC mode and point it to a local mock server.

    // In your test setup
    conf := func(globalConf *config.Config) {
        globalConf.SlaveOptions.UseRPC = true
        // Point to the mock MDCB server started by the test
        globalConf.SlaveOptions.ConnectionPool = "127.0.0.1:9090"
        // Use a short timeout to make the test fast
        globalConf.SlaveOptions.CallTimeout = 1 
    }
    ts := StartTest(conf)
  2. Create a Mock MDCB gRPC Server: The test would need to spin up a gRPC server that listens on 127.0.0.1:9090. This server would need to implement the methods that the RPCStorageHandler calls, such as GetKey. You can program this mock server to simulate an MDCB failure by having it hang or return errors.

  3. Write the Test Case: The test would make a request to an API that triggers the OrganizationMonitor. With the mock MDCB server unresponsive, you can assert that the request doesn't block and returns immediately with an http.StatusOK, as expected from the changes in this PR.

This approach would validate the non-blocking behavior through the entire stack, from the middleware to the RPC client, providing higher confidence than the current unit tests that rely on time.Sleep.

References:


Powered by Visor from Probelabs

Last updated: 2025-11-12T12:02:11.384Z | Triggered by: comment by @lghiur

💡 TIP: You can chat with Visor using /visor ask <your question>

@imogenkraak imogenkraak force-pushed the TT-15954-mdcb-stop-blocking branch from c258e5d to a436b30 Compare November 14, 2025 11:33
@imogenkraak imogenkraak force-pushed the TT-15954-mdcb-stop-blocking branch from 3847a83 to 4a5ebb7 Compare November 14, 2025 12:16
@andrei-tyk andrei-tyk enabled auto-merge (squash) November 19, 2025 12:07
@andrei-tyk
Copy link
Copy Markdown
Contributor

The probe comments are not necesarily a problem as the rate limits are monitored from a data plane perspective and not a control plane one so MCDB going down should not open any vulnerabiltiy for attack. From a code coverage perspective this functionality is more related to an integrated flow so a unit test coverage of 70%+ should be more than enough to be merged.

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
73.9% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@lghiur lghiur disabled auto-merge November 20, 2025 15:47
@lghiur lghiur merged commit 30461c4 into master Nov 20, 2025
47 of 48 checks passed
@lghiur lghiur deleted the TT-15954-mdcb-stop-blocking branch November 20, 2025 15:47
@maciejwojciechowski
Copy link
Copy Markdown
Contributor

/release to release-5.8

probelabs Bot pushed a commit that referenced this pull request Dec 2, 2025
<!-- Provide a general summary of your changes in the Title above -->

## Description

<!-- Describe your changes in detail -->
Fixes request pipeline blocking when MDCB is unavailable by making org
session fetches non-blocking in RPC mode.

## Related Issue

<!-- This project only accepts pull requests related to open issues. -->
<!-- If suggesting a new feature or change, please discuss it in an
issue first. -->
<!-- If fixing a bug, there should be an issue describing it with steps
to reproduce. -->
<!-- OSS: Please link to the issue here. Tyk: please create/link the
JIRA ticket. -->
[TT-15954](https://tyktech.atlassian.net/browse/TT-15954)

## Motivation and Context

<!-- Why is this change required? What problem does it solve? -->
When MDCB is unavailable, synchronous RPC calls to fetch org sessionsin
OrganizationMonitor were blocking the request pipeline for 90-120
seconds

## How This Has Been Tested

<!-- Please describe in detail how you tested your changes -->
<!-- Include details of your testing environment, and the tests -->
<!-- you ran to see how your change affects other areas of the code,
etc. -->
<!-- This information is helpful for reviewers and QA. -->

## Screenshots (if appropriate)

## Types of changes

<!-- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Refactoring or add test (improvements in base code or adds test
coverage to functionality)

## Checklist

<!-- Go over all the following points, and put an `x` in all the boxes
that apply -->
<!-- If there are no documentation updates required, mark the item as
checked. -->
<!-- Raise up any additional concerns not covered by the checklist. -->

- [ ] I ensured that the documentation is up to date
- [ ] I explained why this PR updates go.mod in detail with reasoning
why it's required
- [ ] I would like a code coverage CI quality gate exception and have
explained why

[TT-15954]:
https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

<!---TykTechnologies/jira-linter starts here-->

### Ticket Details

<details>
<summary>
<a href="https://tyktech.atlassian.net/browse/TT-15954" title="TT-15954"
target="_blank">TT-15954</a>
</summary>

|         |    |
|---------|----|
| Status  | In Code Review |
| Summary | Request pipeline blocked by synchronous RPC calls every 10
minutes when MDCB unavailable |

Generated at: 2025-11-20 13:06:06

</details>

<!---TykTechnologies/jira-linter ends here-->

---------

Co-authored-by: andrei-tyk <97896463+andrei-tyk@users.noreply.github.com>
(cherry picked from commit 30461c4)
@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented Dec 2, 2025

✅ Cherry-pick successful. A PR was created: #7582

@maciejwojciechowski
Copy link
Copy Markdown
Contributor

/release to release-5.8.9

probelabs Bot pushed a commit that referenced this pull request Dec 2, 2025
<!-- Provide a general summary of your changes in the Title above -->

## Description

<!-- Describe your changes in detail -->
Fixes request pipeline blocking when MDCB is unavailable by making org
session fetches non-blocking in RPC mode.

## Related Issue

<!-- This project only accepts pull requests related to open issues. -->
<!-- If suggesting a new feature or change, please discuss it in an
issue first. -->
<!-- If fixing a bug, there should be an issue describing it with steps
to reproduce. -->
<!-- OSS: Please link to the issue here. Tyk: please create/link the
JIRA ticket. -->
[TT-15954](https://tyktech.atlassian.net/browse/TT-15954)

## Motivation and Context

<!-- Why is this change required? What problem does it solve? -->
When MDCB is unavailable, synchronous RPC calls to fetch org sessionsin
OrganizationMonitor were blocking the request pipeline for 90-120
seconds

## How This Has Been Tested

<!-- Please describe in detail how you tested your changes -->
<!-- Include details of your testing environment, and the tests -->
<!-- you ran to see how your change affects other areas of the code,
etc. -->
<!-- This information is helpful for reviewers and QA. -->

## Screenshots (if appropriate)

## Types of changes

<!-- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Refactoring or add test (improvements in base code or adds test
coverage to functionality)

## Checklist

<!-- Go over all the following points, and put an `x` in all the boxes
that apply -->
<!-- If there are no documentation updates required, mark the item as
checked. -->
<!-- Raise up any additional concerns not covered by the checklist. -->

- [ ] I ensured that the documentation is up to date
- [ ] I explained why this PR updates go.mod in detail with reasoning
why it's required
- [ ] I would like a code coverage CI quality gate exception and have
explained why

[TT-15954]:
https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

<!---TykTechnologies/jira-linter starts here-->

### Ticket Details

<details>
<summary>
<a href="https://tyktech.atlassian.net/browse/TT-15954" title="TT-15954"
target="_blank">TT-15954</a>
</summary>

|         |    |
|---------|----|
| Status  | In Code Review |
| Summary | Request pipeline blocked by synchronous RPC calls every 10
minutes when MDCB unavailable |

Generated at: 2025-11-20 13:06:06

</details>

<!---TykTechnologies/jira-linter ends here-->

---------

Co-authored-by: andrei-tyk <97896463+andrei-tyk@users.noreply.github.com>
(cherry picked from commit 30461c4)
@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented Dec 2, 2025

✅ Cherry-pick successful. A PR was created: #7583

maciejwojciechowski pushed a commit that referenced this pull request Dec 2, 2025
…ng (#7531) (#7582)

### **User description**
[TT-15954]: Make org session fetch non-blocking (#7531)

<!-- Provide a general summary of your changes in the Title above -->

## Description

<!-- Describe your changes in detail -->
Fixes request pipeline blocking when MDCB is unavailable by making org
session fetches non-blocking in RPC mode.

## Related Issue

<!-- This project only accepts pull requests related to open issues. -->
<!-- If suggesting a new feature or change, please discuss it in an
issue first. -->
<!-- If fixing a bug, there should be an issue describing it with steps
to reproduce. -->
<!-- OSS: Please link to the issue here. Tyk: please create/link the
JIRA ticket. -->
[TT-15954](https://tyktech.atlassian.net/browse/TT-15954)

## Motivation and Context

<!-- Why is this change required? What problem does it solve? -->
When MDCB is unavailable, synchronous RPC calls to fetch org sessionsin
OrganizationMonitor were blocking the request pipeline for 90-120
seconds

## How This Has Been Tested

<!-- Please describe in detail how you tested your changes -->
<!-- Include details of your testing environment, and the tests -->
<!-- you ran to see how your change affects other areas of the code,
etc. -->
<!-- This information is helpful for reviewers and QA. -->

## Screenshots (if appropriate)

## Types of changes

<!-- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Refactoring or add test (improvements in base code or adds test
coverage to functionality)

## Checklist

<!-- Go over all the following points, and put an `x` in all the boxes
that apply -->
<!-- If there are no documentation updates required, mark the item as
checked. -->
<!-- Raise up any additional concerns not covered by the checklist. -->

- [ ] I ensured that the documentation is up to date
- [ ] I explained why this PR updates go.mod in detail with reasoning
why it's required
- [ ] I would like a code coverage CI quality gate exception and have
explained why


[TT-15954]:

https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ






<!---TykTechnologies/jira-linter starts here-->

### Ticket Details

<details>
<summary>
<a href="https://tyktech.atlassian.net/browse/TT-15954" title="TT-15954"
target="_blank">TT-15954</a>
</summary>

|         |    |
|---------|----|
| Status  | In Code Review |
| Summary | Request pipeline blocked by synchronous RPC calls every 10
minutes when MDCB unavailable |

Generated at: 2025-11-20 13:06:06

</details>

<!---TykTechnologies/jira-linter ends here-->

---------

Co-authored-by: andrei-tyk
<97896463+andrei-tyk@users.noreply.github.com>

[TT-15954]:
https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[TT-15954]:
https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[TT-15954]:
https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ


___

### **PR Type**
Bug fix


___

### **Description**
- Make org session fetch non-blocking in RPC

- Add singleflight to dedupe session fetches

- Async refresh for org session expiry cache

- Add tests for async behavior and RPC mode


___

### Diagram Walkthrough


```mermaid
flowchart LR
  ProcReq["OrganizationMonitor.ProcessRequest"] -- "RPC mode, cache miss" --> RefreshBG["refreshOrgSession (async)"]
  ProcReq -- "non-RPC mode, miss" --> SyncFetch["OrgSession (sync)"]
  RefreshBG -- "populate cache or set OrgHasNoSession" --> Cache["SessionCache"]
  OrgExpiry["BaseMiddleware.OrgSessionExpiry"] -- "cache hit" --> ReturnExp["return cached expiry"]
  OrgExpiry -- "cache miss" --> ExpiryBG["refreshOrgSessionExpiry (async)"]
  ExpiryBG -- "found session" --> SetExp["SetOrgExpiry(DataExpires)"]
  ExpiryBG -- "not found/error" --> SetDefault["SetOrgExpiry(DEFAULT)"]
  RPCMock["Mock gorpc server"] -- "slow GetKey" --> Tests["Async RPC tests"]
```



<details> <summary><h3> File Walkthrough</h3></summary>

<table><thead><tr><th></th><th align="left">Relevant
files</th></tr></thead><tbody><tr><td><strong>Tests</strong></td><td><table>
<tr>
  <td>
    <details>

<summary><strong>mw_organisation_activity_test.go</strong><dd><code>Tests
for async org session fetch and RPC mode</code>&nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
<hr>

gateway/mw_organisation_activity_test.go

<ul><li>Add tests for async org session refresh.<br> <li> Implement mock
gorpc server for slow RPC.<br> <li> Verify requests don't block in RPC
mode.<br> <li> Validate OrgHasNoSession handling.</ul>


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7582/files#diff-b3bbd18e384b7f03f44c0d6c9a5205e8acdd117029e1e73412089191ec8e833a">+227/-0</a>&nbsp;
</td>

</tr>

<tr>
  <td>
    <details>
<summary><strong>middleware_test.go</strong><dd><code>Tests for async
org session expiry refresh</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
</dd></summary>
<hr>

gateway/middleware_test.go

<ul><li>Extend OrgSessionExpiry tests for async path.<br> <li> Add
background fetch assertions with delays.<br> <li> Cover non-existent org
default behavior.</ul>


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7582/files#diff-6a09a08e3f82cc5e9d8c6b5c8426d75ea1e5d85e15ab008fca1f512e7c49c1e6">+35/-12</a>&nbsp;
</td>

</tr>
</table></td></tr><tr><td><strong>Bug fix</strong></td><td><table>
<tr>
  <td>
    <details>

<summary><strong>mw_organisation_activity.go</strong><dd><code>Non-blocking
org session fetch with singleflight</code>&nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
<hr>

gateway/mw_organisation_activity.go

<ul><li>Introduce singleflight group for org session fetch.<br> <li> Add
async refreshOrgSession with cache populate.<br> <li> Make RPC mode
fetch non-blocking on cache miss.<br> <li> Minor comment fix for
off-thread path.</ul>


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7582/files#diff-26dd955903317b085be06642ae3e76fe41c8c53844d8758a1a1c8bd05b0110a2">+29/-2</a>&nbsp;
&nbsp; </td>

</tr>

<tr>
  <td>
    <details>
<summary><strong>middleware.go</strong><dd><code>Async org expiry
refresh and emergency defaults</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
<hr>

gateway/middleware.go

<ul><li>Return default on expiry cache miss immediately.<br> <li> Add
async refreshOrgSessionExpiry using singleflight.<br> <li> Short-circuit
to default in emergency mode.<br> <li> Cache defaults when session not
found.</ul>


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7582/files#diff-703054910891a4db633eca0f42ed779d6b4fa75cd9b3aa4c503e681364201c1b">+21/-15</a>&nbsp;
</td>

</tr>
</table></td></tr><tr><td><strong>Miscellaneous</strong></td><td><table>
<tr>
  <td>
    <details>
<summary><strong>coverage.out</strong><dd><code>Add coverage report
artifact</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; </dd></summary>
<hr>

gateway/coverage.out

- Add coverage output file.


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7582/files#diff-09774255d9a84e7fadb3d7b29c523e342197d9e6cb340482bce64a09425eca0f">+9299/-0</a></td>

</tr>
</table></td></tr></tr></tbody></table>

</details>

___

Co-authored-by: imogenkraak <162994391+imogenkraak@users.noreply.github.com>
Co-authored-by: andrei-tyk <97896463+andrei-tyk@users.noreply.github.com>
maciejwojciechowski pushed a commit that referenced this pull request Dec 2, 2025
…king (#7531) (#7583)

### **User description**
[TT-15954]: Make org session fetch non-blocking (#7531)

<!-- Provide a general summary of your changes in the Title above -->

## Description

<!-- Describe your changes in detail -->
Fixes request pipeline blocking when MDCB is unavailable by making org
session fetches non-blocking in RPC mode.

## Related Issue

<!-- This project only accepts pull requests related to open issues. -->
<!-- If suggesting a new feature or change, please discuss it in an
issue first. -->
<!-- If fixing a bug, there should be an issue describing it with steps
to reproduce. -->
<!-- OSS: Please link to the issue here. Tyk: please create/link the
JIRA ticket. -->
[TT-15954](https://tyktech.atlassian.net/browse/TT-15954)

## Motivation and Context

<!-- Why is this change required? What problem does it solve? -->
When MDCB is unavailable, synchronous RPC calls to fetch org sessionsin
OrganizationMonitor were blocking the request pipeline for 90-120
seconds

## How This Has Been Tested

<!-- Please describe in detail how you tested your changes -->
<!-- Include details of your testing environment, and the tests -->
<!-- you ran to see how your change affects other areas of the code,
etc. -->
<!-- This information is helpful for reviewers and QA. -->

## Screenshots (if appropriate)

## Types of changes

<!-- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Refactoring or add test (improvements in base code or adds test
coverage to functionality)

## Checklist

<!-- Go over all the following points, and put an `x` in all the boxes
that apply -->
<!-- If there are no documentation updates required, mark the item as
checked. -->
<!-- Raise up any additional concerns not covered by the checklist. -->

- [ ] I ensured that the documentation is up to date
- [ ] I explained why this PR updates go.mod in detail with reasoning
why it's required
- [ ] I would like a code coverage CI quality gate exception and have
explained why


[TT-15954]:

https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ






<!---TykTechnologies/jira-linter starts here-->

### Ticket Details

<details>
<summary>
<a href="https://tyktech.atlassian.net/browse/TT-15954" title="TT-15954"
target="_blank">TT-15954</a>
</summary>

|         |    |
|---------|----|
| Status  | In Code Review |
| Summary | Request pipeline blocked by synchronous RPC calls every 10
minutes when MDCB unavailable |

Generated at: 2025-11-20 13:06:06

</details>

<!---TykTechnologies/jira-linter ends here-->

---------

Co-authored-by: andrei-tyk
<97896463+andrei-tyk@users.noreply.github.com>

[TT-15954]:
https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[TT-15954]:
https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[TT-15954]:
https://tyktech.atlassian.net/browse/TT-15954?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ


___

### **PR Type**
Bug fix


___

### **Description**
- Make org session fetch non-blocking in RPC

- Add singleflight to dedupe session fetches

- Async refresh for org expiry cache misses

- Add tests for async, non-blocking behavior


___

### Diagram Walkthrough


```mermaid
flowchart LR
  A["ProcessRequest (OrganizationMonitor)"] -- "RPC mode & no cache" --> B["refreshOrgSession (async)"]
  A -- "Non-RPC & no cache" --> C["OrgSession (sync)"]
  D["OrgSessionExpiry (BaseMiddleware)"] -- "cache miss" --> E["refreshOrgSessionExpiry (async)"]
  F["singleflight.Group"] -- "dedupe fetches" --> B
```



<details> <summary><h3> File Walkthrough</h3></summary>

<table><thead><tr><th></th><th align="left">Relevant
files</th></tr></thead><tbody><tr><td><strong>Bug
fix</strong></td><td><table>
<tr>
  <td>
    <details>
<summary><strong>mw_organisation_activity.go</strong><dd><code>Async org
session fetch with singleflight</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; </dd></summary>
<hr>

gateway/mw_organisation_activity.go

<ul><li>Introduce singleflight for org session fetches.<br> <li> In RPC
mode, fetch org session asynchronously.<br> <li> Add background cache
population via refresh function.<br> <li> Minor comment fix and flow
adjustments.</ul>


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7583/files#diff-26dd955903317b085be06642ae3e76fe41c8c53844d8758a1a1c8bd05b0110a2">+29/-2</a>&nbsp;
&nbsp; </td>

</tr>

<tr>
  <td>
    <details>
<summary><strong>middleware.go</strong><dd><code>Async org expiry
refresh and non-blocking path</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
<hr>

gateway/middleware.go

<ul><li>Make OrgSessionExpiry return default on miss.<br> <li> Trigger
async refresh on cache miss.<br> <li> Avoid blocking and handle
emergency mode.<br> <li> Cache default when session not found.</ul>


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7583/files#diff-703054910891a4db633eca0f42ed779d6b4fa75cd9b3aa4c503e681364201c1b">+21/-15</a>&nbsp;
</td>

</tr>
</table></td></tr><tr><td><strong>Tests</strong></td><td><table>
<tr>
  <td>
    <details>

<summary><strong>mw_organisation_activity_test.go</strong><dd><code>Tests
for async org session and RPC behavior</code>&nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
</dd></summary>
<hr>

gateway/mw_organisation_activity_test.go

<ul><li>Add tests for refreshOrgSession behavior.<br> <li> Add RPC-mode
async non-blocking request tests.<br> <li> Implement mock RPC server
simulating delays.<br> <li> Verify cache population and no-session
flagging.</ul>


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7583/files#diff-b3bbd18e384b7f03f44c0d6c9a5205e8acdd117029e1e73412089191ec8e833a">+227/-0</a>&nbsp;
</td>

</tr>

<tr>
  <td>
    <details>
<summary><strong>middleware_test.go</strong><dd><code>Tests for async
org expiry refresh</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
<hr>

gateway/middleware_test.go

<ul><li>Add tests for async expiry refresh flow.<br> <li> Validate
cached value, default on miss.<br> <li> Ensure default retained for
missing org.</ul>


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7583/files#diff-6a09a08e3f82cc5e9d8c6b5c8426d75ea1e5d85e15ab008fca1f512e7c49c1e6">+35/-12</a>&nbsp;
</td>

</tr>
</table></td></tr><tr><td><strong>Miscellaneous</strong></td><td><table>
<tr>
  <td>
    <details>
<summary><strong>coverage.out</strong><dd><code>Add coverage output
artifact</code>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp; </dd></summary>
<hr>

gateway/coverage.out

- Add coverage report artifact to repo.


</details>


  </td>
<td><a
href="https://github.com/TykTechnologies/tyk/pull/7583/files#diff-09774255d9a84e7fadb3d7b29c523e342197d9e6cb340482bce64a09425eca0f">+9299/-0</a></td>

</tr>
</table></td></tr></tr></tbody></table>

</details>

___

Co-authored-by: imogenkraak <162994391+imogenkraak@users.noreply.github.com>
Co-authored-by: andrei-tyk <97896463+andrei-tyk@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants