Skip to content

perf: skip prod Docker steps on PRs, use larger runner#8057

Closed
buger wants to merge 12 commits intorelease-5.12.1from
perf/goreleaser-optimization
Closed

perf: skip prod Docker steps on PRs, use larger runner#8057
buger wants to merge 12 commits intorelease-5.12.1from
perf/goreleaser-optimization

Conversation

@buger
Copy link
Copy Markdown
Member

@buger buger commented Apr 16, 2026

Summary

Built on top of #8029 to test goreleaser performance optimizations before rolling them out via gromit.

Commit 1: skip prod Docker steps, larger runner

  • Larger runner: vars.BUILD_RUNNER || vars.DEFAULT_RUNNER — set BUILD_RUNNER=warp-8x-x64 in repo variables to use 8 vCPUs for goreleaser (currently 4)
  • Skip prod Docker steps on PRs: tag metadata, prod build-push, and cache export for all 3 variants (ee, fips, std) are now completely skipped on non-tag builds
  • Remove redundant cache-to: prod steps no longer export GHA cache (CI steps already do)

Commit 2: skip fips/std CI Docker pushes on PRs, amd64-only for ee

  • Skip fips and std Docker metadata + CI push steps on PRs since no downstream PR gate job consumes those ECR images
  • Build ee Docker image as amd64-only on PRs since downstream tests only run on amd64
  • Full multi-arch builds still run on branch pushes and tag pushes

Expected savings

Change Estimated savings
Skip 3 prod Docker steps (build + SBOM + provenance + cache export) ~9 min
Remove 3 redundant cache exports ~4-8 min
Skip fips + std CI Docker pushes on PRs ~6-10 min
amd64-only ee CI image on PRs (skip arm64 + s390x) ~4-6 min
Larger runner (if BUILD_RUNNER set) ~10-15 min

CI/ECR ee image still builds on every PR (amd64-only) — fips and std CI images are skipped on PRs entirely. All images build fully on branch pushes and tag pushes.

Test plan

Gromit template PR: TykTechnologies/gromit#453

🤖 Generated with Claude Code

andrei-tyk and others added 8 commits April 15, 2026 15:47
<!-- Provide a general summary of your changes in the Title above -->

## Description

Fix a critical regression introduced in TT-16265 (unified path matching,
released in 5.12.0/5.8.12) where `validateRequest` and `mockResponse`
middleware on a parameterised OAS path (e.g. `/employees/{id}`)
incorrectly fires for requests to a static path (e.g.
`/employees/static`) that does not have that middleware configured.

The fix applies two changes at API load time (no request-time changes):

**Layer 1 — Path ordering:** Both `compileOASValidateRequestPathSpec`
and `compileOASMockResponsePathSpec` now sort their output using the
same algorithm as `oasutil.SortByPathLength` (used by Classic
middleware). This ensures deterministic ordering across gateway restarts
and places static paths before parameterised paths in the `rxPaths`
list. Previously, these functions iterated a Go map, producing
non-deterministic ordering.

**Option D — Synthetic shield entries:** For every static path in the
OAS spec that does NOT have `validateRequest` (or `mockResponse`)
configured, a disabled entry (`Enabled: false`) is injected into the
path list. Because static entries are sorted before parameterised ones,
`FindSpecMatchesStatus` matches the shield first. The middleware sees
`Enabled == false` and short-circuits — preventing the parameterised
regex from cross-matching.

A shared helper `addStaticPathShields` eliminates duplication between
the two compile functions. The sort comparator `PathLess` was extracted
from `oasutil.SortByPathLength` into an exported function, ensuring both
Classic and OAS middleware use the exact same ordering logic with no
risk of drift.

## Related Issue

TT-16890

Related tickets:
- TT-16265 (introduced the regression by unifying path matching)
- TT-16292 (OAS path parameter normalization — separate issue, not
addressed here)

## Motivation and Context

This regression is blocking a customer from upgrading to v5.12.x. In
5.11.x, `validateRequest` used the OAS router (kin-openapi) which
naturally prioritised static paths over parameterised paths. In 5.12.0,
it was moved to the standard regex-based path list to fix
inconsistencies with gateway path matching config. However, the regex
`([^/]+)` compiled from `{id}` matches any single-segment value
including `static`, causing cross-matching.

Additionally, the path list for OAS-specific middleware was built by
iterating a Go map without sorting, making the order non-deterministic
across gateway restarts — inconsistent with Classic middleware which
always sorts paths.

## How This Has Been Tested

**Unit tests:**
- `TestSortURLSpecsByPathPriority` — 5 table-driven cases verifying sort
correctness (static before parameterised, segment count, length,
alphabetical tiebreaker, mixed paths)

**Integration tests:**
- `TestStaticPathTakesPrecedenceOverParameterised` — Exact customer
scenario: `/employees/static` (no validateRequest) alongside
`/employees/{id}` (validateRequest with pattern `^[a-zA-Z]+$`).
Verifies:
  - `GET /employees/static` → 200 (validateRequest does NOT fire)
  - `GET /employees/john` → 200 (validateRequest fires, passes pattern)
  - `GET /employees/123` → 422 (validateRequest fires, fails pattern)
- `TestMockResponseStaticPathPriority` — Same pattern for mockResponse:
`/items/special` (no mock) alongside `/items/{id}` (mock enabled).
Verifies static path does not return mock body.
- `TestStaticPathPriorityWithPrefixMatching` — Both paths have
validateRequest with different validators (header vs path param). With
`EnablePathPrefixMatching` enabled, verifies each path matches its own
validator.

**Benchmarks:**
- `BenchmarkSortURLSpecsByPathPriority` — Sort overhead at 10/50/200
paths (< 0.5ms even at 200 paths, runs only at API load time)
- `BenchmarkOASValidateRequestStaticVsParameterized` — Request-time
performance with 52 OAS paths: ~2.5µs/op, 26 allocs/op for both static
(shield hit) and parameterised (validation hit) paths — no measurable
overhead from shield entries

**Existing test suites:** All existing tests pass, including
`TestValidateRequest`, `TestValidateRequest_PrefixMatching`,
`TestValidateRequest_SuffixMatching`,
`TestValidateRequest_BothMatchingEnabled`,
`TestValidateRequest_PathParameters`,
`TestOASMockResponseUnifiedPathMatching`,
`TestOASPathMatchingRespectGatewayConfig`, and `TestSortByPathLength`.

## Screenshots (if appropriate)

## Types of changes

<!-- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Refactoring or add test (improvements in base code or adds test
coverage to functionality)

## Checklist

<!-- Go over all the following points, and put an `x` in all the boxes
that apply -->
<!-- If there are no documentation updates required, mark the item as
checked. -->
<!-- Raise up any additional concerns not covered by the checklist. -->

- [ ] I ensured that the documentation is up to date
- [ ] I explained why this PR updates go.mod in detail with reasoning
why it's required
- [ ] I would like a code coverage CI quality gate exception and have
explained why

---------

Co-authored-by: Laurentiu <6229829+lghiur@users.noreply.github.com>
(cherry picked from commit a48ebe5)
<!-- Provide a general summary of your changes in the Title above -->

[TT-16693](https://tyktech.atlassian.net/browse/TT-16693)
[TT-16695](https://tyktech.atlassian.net/browse/TT-16695)
[TT-16696](https://tyktech.atlassian.net/browse/TT-16696)

## Description
When running tests that use the StartTest helper with
`config.SlaveOptions.UseRPC` enabled, the gateway enters an "emergency
mode" and the gateway falls back to loading API Definitions and Policies
from a Redis backup instead of the master nodes.

Because the entire test suite runs against a single shared Redis
instance, tests running in parallel would often write to the same Redis
keys. This created a race condition where a test could read data written
by another test.

The data, while often valid JSON, would have an incorrect structure for
the context in which it was being read. The unmarshaling process would
succeed but produce a slice containing nil values
([]model.MergedAPI{nil}). Subsequent code that did not anticipate these
nil values would then panic, causing the test to fail unpredictably.

Subsequently, the `prepareSpecs` function iterates over this slice and
calls methods on the APIDefinition. This leads to a nil pointer
dereference panic when it encounters one of the nil entries.

## List of change
1. Update dispatcher setup at tests to assign default func handlers and
make code cleaner
2. Update the expected response values to provide better documentation
3. Add unique tags for RPC based tests to avoid fetching invalid values
populated by other tests

## How This Has Been Tested
This change is targeting flaky tests not the production code in
particular

## Types of changes

<!-- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->

- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)
- [ ] Refactoring or add test (improvements in base code or adds test
coverage to functionality)

## Checklist

<!-- Go over all the following points, and put an `x` in all the boxes
that apply -->
<!-- If there are no documentation updates required, mark the item as
checked. -->
<!-- Raise up any additional concerns not covered by the checklist. -->

- [x] I ensured that the documentation is up to date
- [x] I explained why this PR updates go.mod in detail with reasoning
why it's required
- [ ] I would like a code coverage CI quality gate exception and have
explained why



























<!---TykTechnologies/jira-linter starts here-->

### Ticket Details

<details>
<summary>
<a href="https://tyktech.atlassian.net/browse/TT-16693" title="TT-16693"
target="_blank">TT-16693</a>
</summary>

|         |    |
|---------|----|
| Status  | In Code Review |
| Summary | [CI GW] Failing Unit Test:
TestProcessKeySpaceChanges_UserKeyReset |

Generated at: 2026-03-10 06:40:24

</details>

<!---TykTechnologies/jira-linter ends here-->




























[TT-16693]:
https://tyktech.atlassian.net/browse/TT-16693?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[TT-16695]:
https://tyktech.atlassian.net/browse/TT-16695?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[TT-16696]:
https://tyktech.atlassian.net/browse/TT-16696?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: Vlad Zabolotnyi <vlad.z@tyk.io>
- Use vars.BUILD_RUNNER (falls back to vars.DEFAULT_RUNNER) so the
  goreleaser job can use a larger runner (e.g. warp-8x-x64) without
  affecting other jobs
- Skip prod Docker metadata + build-push steps entirely on non-tag
  builds (saves ~9 min of unnecessary multi-platform builds, SBOM,
  provenance, and cache export that ran with push=false)
- Remove redundant cache-to from prod steps (CI steps already export)
- Simplify prod push: to true since step only runs on tags now

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@buger buger requested a review from a team as a code owner April 16, 2026 17:42
@github-actions
Copy link
Copy Markdown
Contributor

API Changes

no api changes detected

@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented Apr 16, 2026

This PR optimizes the release.yml GitHub Actions workflow to significantly reduce build times for pull requests.

Files Changed Analysis

  • .github/workflows/release.yml: The only file changed. The modifications focus on adding conditional logic to skip production-specific steps during PR builds, updating the runner configuration, and optimizing Docker build parameters. The number of additions and deletions is equal, reflecting the modification of existing workflow steps rather than the addition of new ones.

Architecture & Impact Assessment

  • What this PR accomplishes:

    • Accelerates the CI feedback loop for developers by skipping time-consuming Docker image publishing steps that are only necessary for official releases.
    • Introduces the ability to use a more powerful, configurable runner for build jobs, further reducing execution time.
    • Optimizes the Docker build process for PRs by building for a single platform (linux/amd64) and removing redundant cache export operations.
  • Key technical changes introduced:

    • The runs-on directive now uses vars.BUILD_RUNNER || vars.DEFAULT_RUNNER, allowing the build runner to be scaled via repository variables.
    • Production Docker metadata generation and image push steps for all variants (ee, fips, std) are now conditionally executed only for tag pushes (if: startsWith(github.ref, 'refs/tags')).
    • The platforms for the initial CI Docker build is conditionally limited to linux/amd64 on pull requests to speed up cross-compilation.
    • The cache-to directive is removed from production push steps to eliminate redundant cache writes.
    • Note: The conditions for pushing fips and std CI images have been modified to not run on pull requests (github.event_name != 'pull_request'). This appears to contradict the PR description's claim that all CI/ECR images are still built. This should be verified during review.
  • Affected system components:

    • The primary impact is on the CI/CD pipeline's performance and efficiency. There are no changes to the application's source code or runtime behavior.
  • Workflow Visualization:

graph TD
subgraph release_workflow ["Release Workflow"]
A[Start Job] --> B{Is event a tag push?}
B --|No (PR)|--> C[Build for amd64 only]
C --> D[Push EE CI Image]
D --> E[Skip Prod Image Steps]
E --> F[Finish]
B --|Yes (Tag)|--> G[Build for all platforms]
G --> H[Push EE/FIPS/STD CI Images]
H --> I[Generate Metadata & Push Prod Images]
I --> F
end


## Scope Discovery & Context Expansion
- The changes are localized to the `release.yml` workflow but improve a critical part of the development lifecycle by optimizing for the most common event (pull request updates).
- This PR establishes a clear separation between CI builds (for validation) and release builds (for publication), which is a CI/CD best practice.
- The use of `vars.BUILD_RUNNER` points to an operational strategy of managing CI infrastructure resources through GitHub variables, allowing for performance tuning without code changes.
- A key area for review is to confirm whether downstream jobs depend on the `fips` and `std` CI images being available on every PR, as the current changes will prevent them from being pushed.


<details>
  <summary>Metadata</summary>

  - Review Effort: 2 / 5
  - Primary Label: chore


</details>
<!-- visor:section-end id="overview" -->

<!-- visor:thread-end key="TykTechnologies/tyk#8057@3953c51" -->

---

*Powered by [Visor](https://probelabs.com/visor) from [Probelabs](https://probelabs.com)*

*Last updated: 2026-04-16T20:23:49.979Z | Triggered by: pr_updated | Commit: 3953c51*

💡 **TIP:** You can chat with Visor using `/visor ask <your question>`
<!-- /visor-comment-id:visor-thread-overview-TykTechnologies/tyk#8057 -->

@probelabs
Copy link
Copy Markdown
Contributor

probelabs Bot commented Apr 16, 2026

✅ Security Check Passed

No security issues found – changes LGTM.

✅ Performance Check Passed

No performance issues found – changes LGTM.

✅ Security Check Passed

No security issues found – changes LGTM.

\n\n \n\n

✅ Performance Check Passed

No performance issues found – changes LGTM.

\n\n

Quality Issues (2)

Severity Location Issue
🔴 Critical .github/workflows/release.yml:167
The expression used to conditionally set the `platforms` value is incorrect. The logical operators `&&` and `||` in GitHub Actions expressions return boolean values (`true` or `false`), not one of the operands. This expression will always evaluate to `true`, which is an invalid value for the `platforms` input and will cause the workflow step to fail.
💡 SuggestionUse the `if()` function to correctly select the platform string based on the event type.
🔧 Suggested Fix
          platforms: ${{ if(github.event_name == 'pull_request', 'linux/amd64', 'linux/amd64,linux/arm64,linux/s390x') }}
🟠 Error .github/workflows/release.yml:215-305
The condition `github.event_name != 'pull_request'` was added to the steps that push `fips` and `std` CI images (lines 215, 226, 290, 305). This prevents these images from being built and pushed on pull requests, which contradicts the PR description's statement: "CI/ECR images still build on every PR as before". This change could break downstream test jobs that rely on these images being available for PRs.
💡 SuggestionTo align with the stated intent in the PR description, remove the `&& github.event_name != 'pull_request'` condition from the `if` clauses for the `fips` and `std` CI image push steps.

Powered by Visor from Probelabs

Last updated: 2026-04-16T20:23:47.605Z | Triggered by: pr_updated | Commit: 3953c51

💡 TIP: You can chat with Visor using /visor ask <your question>

buger and others added 4 commits April 16, 2026 22:22
…ec227b/TT-16890' into perf/goreleaser-optimization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base automatically changed from merge/release-5.12.1/a48ebe56d51335fd2348abc2f377e69ec6ec227b/TT-16890 to release-5.12.1 April 16, 2026 20:58
@github-actions
Copy link
Copy Markdown
Contributor

🚨 Jira Linter Failed

Commit: daced74
Failed at: 2026-04-16 20:58:44 UTC

The Jira linter failed to validate your PR. Please check the error details below:

🔍 Click to view error details
failed to validate branch and PR title rules: neither branch name 'perf/goreleaser-optimization' nor PR title 'perf: skip prod Docker steps on PRs, use larger runner' contains a valid Jira ticket ID (e.g., ABC-123)

Next Steps

  • Ensure your branch name contains a valid Jira ticket ID (e.g., ABC-123)
  • Verify your PR title matches the branch's Jira ticket ID
  • Check that the Jira ticket exists and is accessible

This comment will be automatically deleted once the linter passes.

@buger
Copy link
Copy Markdown
Member Author

buger commented Apr 17, 2026

Closing: test PR, served its purpose. Optimizations captured in gromit#453.

@buger buger closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants