Skip to content

Speed up CI test runs (round 2): -T 1C in CI, forkCount tuning, OpenShift delete tests #7675

@manusa

Description

@manusa

Follow-up to #7648 (closed by #7673). After landing the first round, audited the build a second time to find the next set of measurable wins. A few items from the original umbrella turned out to be moot once verified against the post-#7673 tree — those are listed under "Explicitly out of scope" with the reason.

Status

  • 1. CI: add -T 1C to multi-module Maven invocations — local make already uses it; CI does not. Largest expected wall-time win on PRs. Risk: low (Sonar profile already opts out).
  • 2. forkCount=1C on kubernetes-client-api and kubernetes-client — measured locally below. Risk: low (state-leakage already addressed by chore(kubernetes-client-api): enable surefire fork reuse #7667/chore(kubernetes-client): enable surefire fork reuse #7673).
  • 3. Fix the OpenShift mock delete test pattern — ~13 tests waste ~16 s each in retry backoff against a single-shot mock. Risk: low; localized to test-side withGracePeriod(0) / additional mock expectation.
  • 4. MockedStatic-leak guard (JUnit 5 extension) — covered by chore(kubernetes-client): enable surefire fork reuse #7673's incident; only 4 test files in the whole repo use MockedStatic, so the guard surface is small. Risk: none.
  • 5. Ship-leak fix: chaosmesh-client packages org.mockito.plugins.MockMaker in its production jar — out of scope for "speed", but discovered during this audit; flagging because it leaks mock-maker-inline to consumers. Risk: none.
  • 6. Drop the explicit mockito-inline:5.2.0 test deps and the in-tree mockito-extensions resource — Mockito 5 (this project uses 5.11.0/5.23.0) bundles the inline maker as the default in mockito-core; the separate dep and the resource files are no-ops that confuse readers. Risk: none.
  • 7. (Lower priority) Investigate JUnit 5 in-fork parallel execution for kubernetes-tests — would require auditing 14 tests that mutate System.setProperty, plus per-class @EnableKubernetesMockClient shared state. Document and defer unless items 1–3 don't deliver enough headroom.
  • 8. (Lower priority) Move javadocs.yml and generate-model.yml off macos-latest — both are Java/Go workflows that have no macOS-specific work. Not a wall-time win (arm64 macos-latest is per-core faster than the standard 2-vCPU Linux runner); the rationale is hygiene — 10× billing multiplier, macOS runner capacity volatility, slower disk I/O. Risk: none for the JavaDocs job; trivial verification needed for make generate-model.

Experimental baseline vs. changes

All times on a local 16-core box, JDK 11. Wall time is end-to-end mvn test -pl <module> (includes a small recompile/install step that's identical between rows). On 2-core GitHub runners the absolute numbers will be smaller, but the relative shape holds.

Module Config Wall Test phase Failures
kubernetes-client-api current (forkCount=1, reuseForks=true) 68 s ~37 s 1 (pre-existing KubeConfigUtilsMergeTest flake)
kubernetes-client-api -DforkCount=1C 48 s ~17 s 1 (same)
kubernetes-tests current (forkCount=1C, reuseForks=true) 199 s ~165 s in surefire 0

Top-10 slowest classes in kubernetes-tests (post-#7673):

Wall (s) Class Note
122.2 io.fabric8.openshift.client.server.mock.OpenShiftResourcesTest 100+ parameterized resources × ~1 s each, mostly delete
40.5 MachineConfigPoolTest delete 18.3 s alone
39.0 SubscriptionTest delete 8.6 s, create 11.3 s
37.3 MachineTest delete 16.8 s
36.9 AlertmanagerConfigTest delete 16.1 s
36.8 ConfigTest (openshift) delete 16.1 s
36.6 PodTest mixed; not a single hotspot
36.1 ProbeTest delete 17.4 s
35.1 NetworkTest delete 16.2 s
35.0 ResourceTest mixed

The delete-heavy classes are item 3.


1. Add -T 1C to multi-module Maven invocations in CI

Makefile:19 sets override MAVEN_ARGS += -T 1C for every local target except sonar, so contributors running make install / make quickly already get parallel module builds. CI does not — .github/workflows/build.yml:69 is mvn ${MAVEN_ARGS} install with MAVEN_ARGS=-B -C -V -ntp -Dhttp.keepAlive=false -e -Dit.test.bom=true (no -T). Same story for windows-build.yml:57 (./mvnw clean install). The result is that the three Linux PR cells and the Windows cell all build modules sequentially.

Proposed change: add -T 1C to MAVEN_ARGS (or to the inline mvn line) in .github/workflows/build.yml and .github/workflows/windows-build.yml. Sonar (sonar.ymlmake sonar) already opts out, so it's unaffected.

Expected impact: large but hard to estimate without trying — modules like the extension/model trees compile independently, and on a 2-core runner -T 1C should overlap test runs across modules. Pragmatic guess: ~20–30 % off each PR cell's mvn install step.

Risks: one historical reason -T 1C was avoided in CI was test-parallel flakiness, but the only currently-documented blocker is Sonar (Makefile:65). Worth a single trial PR with the flag enabled and the matrix re-run. If a specific module fails under parallel execution, isolate it with <inheritAll> / <dependencies> rather than reverting the global flag.

2. forkCount=1C on kubernetes-client-api (and kubernetes-client)

kubernetes-client-api/pom.xml:233 and kubernetes-client/pom.xml:150 both pin forkCount=1. Local measurement on kubernetes-client-api:

  • forkCount=1: 68 s wall, ~37 s test
  • forkCount=1C: 48 s wall, ~17 s test (-55 %)

Same flaky test as baseline; no new failures. The state-leakage classes that justified pinning forkCount=1 were addressed in #7667 (env-var save/restore in CertUtilsTest) and #7673 (MockedStatic cleanup in AbstractWatchManagerTest); nothing remaining in either module appears to need single-forked execution.

Proposed change:

  • kubernetes-client-api/pom.xml: <forkCount>1</forkCount><forkCount>1C</forkCount>.
  • kubernetes-client/pom.xml: same.

Expected impact: −20 s wall on kubernetes-client-api per matrix cell on a 16-core box; on 2-core CI runners the gain is smaller but still positive (forkCount=1C resolves to 2 forks). Not worth bumping kubernetes-tests further (it's already 1C).

Risks: the standard parallel-fork hazards (filesystem races, port conflicts). The kubernetes-tests module has been at 1C for two weeks without incident; both other modules use the same mock-server harness.

3. Audit OpenShift mock delete tests

The single biggest concentration of waste in kubernetes-tests is the OpenShift *Test#delete pattern. Example — kubernetes-tests/src/test/java/io/fabric8/openshift/client/server/mock/MachineConfigPoolTest.java:71:

server.expect().delete().withPath("/apis/.../machineconfigpools/cluster")
    .andReturn(HttpURLConnection.HTTP_OK, createNewMachineConfigPool("cluster"))
    .once();

boolean isDeleted = client.machineConfigurations().machineConfigPools()
    .withName("cluster").delete().size() == 1;

The mock is set to once(). The first DELETE succeeds. With the default gracePeriodSeconds=-1 (HasMetadataOperation.DEFAULT_GRACE_PERIOD_IN_SECONDS), the client subsequently issues additional requests to confirm the deletion; the mock has no expectation for those, so the request layer drops into the default retry loop (Config.DEFAULT_REQUEST_RETRY_BACKOFFLIMIT=10, exponential backoff from 100 ms) and eats ~16 s per test.

Affected tests, with measured per-method times from this run:

Test Method Time
MachineConfigPoolTest delete 18.3 s
ProbeTest delete 17.4 s
MachineTest delete 16.8 s
SecurityContextConstraintsCrudTest crudTest 16.3 s
NetworkTest delete 16.2 s
AlertmanagerConfigTest delete 16.1 s
ConfigTest (openshift) delete 16.1 s
TemplateInstanceTest delete 14.2 s
NetworkAttachmentDefinitionTest delete 14.1 s
OverlappingRangeIPReservationTest delete 13.2 s
ImageSignatureTest delete 9.0 s
SubscriptionTest delete 8.6 s
RangeAllocationTest delete 5.9 s
OpenShiftResourcesTest#resource_whenDelete... (parameterized, ~100 invocations) 122 s class total, dominated by the delete loop

Sum: ~310 s of predominantly retry-backoff. On a single-fork run the whole class total is ~165 s of test phase, so the impact is bounded by the longest fork rather than the literal sum, but on a forkCount=1C run these hot classes serialize one of the forks for a long time.

Proposed change: per affected test, either (a) add .withGracePeriod(0) to the call chain, which short-circuits the confirmation poll, or (b) declare a follow-up andReturn(HTTP_NOT_FOUND, ...) expectation so the polling exits on its first re-check. Option (a) is one-character per test and matches behavior other tests already use. The OpenShiftResourcesTest parameterized loop should also adopt option (a).

Expected impact: ~2–3 minutes off kubernetes-tests, depending on how forks happen to load-balance.

Risks: any test that intends to exercise the polling path would lose coverage. Spot-checking the listed tests, none of them assert anything beyond isDeleted == true, so the polling behavior is incidental — but a per-test review is needed before flipping.

4. MockedStatic-leak guard

This is item from the original umbrella, narrowed by inventory: only 4 test files in the entire repo use MockedStatic:

  • kubernetes-client-api/src/test/java/io/fabric8/kubernetes/client/utils/HttpClientUtilsTest.java
  • kubernetes-client-api/src/test/java/io/fabric8/kubernetes/client/utils/OpenIDConnectionUtilsTest.java
  • kubernetes-client/src/test/java/io/fabric8/kubernetes/client/dsl/internal/AbstractWatchManagerTest.java
  • kubernetes-client/src/test/java/io/fabric8/kubernetes/client/dsl/internal/PortForwarderWebsocketListenerTest.java

#7673's incident was specifically a MockedStatic opened but never closed in AbstractWatchManagerTest. With this small a surface, a JUnit 5 extension that fails the test if Mockito.mockingDetails(...) shows lingering static-mock registrations after the test completes is cheap to add (one shared extension class, registered via META-INF/services/org.junit.jupiter.api.extension.Extension so it auto-runs across both modules).

Proposed change: add a small extension under a shared test-utility location and register it via junit-platform.properties (junit.jupiter.extensions.autodetection.enabled=true is already on for kubernetes-itests but not for these modules — would need adding).

Expected impact: 0 wall time. Pure regression insurance.

Risks: false positives if any test legitimately needs a MockedStatic to outlive the test (none do today).

Open question: given there's only one historical incident, is the extension worth its maintenance cost? Lean yes because the bug class is invisible until you flip a build setting, but happy to defer if maintainers prefer.

5. chaosmesh-client ships a Mockito MockMaker in its production jar

Side discovery, not a speed item but came out of the audit. extensions/chaosmesh/client/src/main/resources/mockito-extensions/org.mockito.plugins.MockMaker is in src/main/resources, not src/test/resources. Since chaosmesh-client packages as bundle, this file is shipped to consumers — anyone depending on chaosmesh-client in a project that also uses Mockito will silently switch to mock-maker-inline because of us.

History: introduced in commit 09edb3d246 (PR #4447, Dec 2022) — almost certainly meant to be src/test/resources.

Proposed change: git mv extensions/chaosmesh/client/src/main/resources/mockito-extensions/ extensions/chaosmesh/client/src/test/resources/mockito-extensions/. (And see item 6 — once this moves to test, it's also redundant on Mockito 5.)

6. Drop redundant mockito-inline deps and resource files

Eight modules pull mockito-inline:5.2.0 as a test dep (pom.xml:122,892 and the per-module declarations in httpclient-jetty, httpclient-vertx-5, httpclient-vertx, httpclient-okhttp, httpclient-jdk, kubernetes-model-generator/kubernetes-model-common, kubernetes-client-api). Mockito 5 ships mock-maker-inline as the default mock maker bundled in mockito-core; the separate mockito-inline artifact is a deprecated leftover from the 4.x era when inline was opt-in. With mockito.version=5.23.0 (and 5.11.0 on Java 17+) already in use, the explicit dep:

  1. Pins a stale 5.2.0 version that, transitively, would pull in mockito-core:5.2.0 if it weren't overridden.
  2. Has no effect — the inline maker is already active by default.

kubernetes-client/src/test/resources/mockito-extensions/org.mockito.plugins.MockMaker (mock-maker-inline) falls in the same category: redundant with Mockito 5+ defaults.

Proposed change: delete the eight mockito-inline declarations, the version property, and the kubernetes-client resource file. Same for the chaosmesh file in item 5 once it's moved to test scope.

Expected impact: tiny — small reduction in dep resolution and JVM startup. Mainly hygiene + reduces a future-version footgun.

7. JUnit 5 in-fork parallel execution (lower priority)

Surefire's <parallel> is JUnit 4 syntax — already removed from the parent pom in #7657. The Jupiter equivalent is junit-platform.properties with junit.jupiter.execution.parallel.enabled=true and mode.default=concurrent. This would parallelize tests within a single fork, layered on top of forkCount=1C.

Why I'm flagging this as lower-priority rather than recommending it:

  • 14 test files in kubernetes-client(-api) mutate System.setProperty / System.setProperties (ConfigTest, ConfigConstructorTest, ConfigProxySourceTest, ConfigSourcePrecedenceTest, ConfigRefreshTest, ConfigAutoConfigureTest, ConfigDisableAutoConfigurationTest, TokenRefreshInterceptorTest, OpenIDConnectionUtilsBehaviorTest, HttpClientUtilsTest, CertUtilsTest, UtilsTest, plus the URL-from-env classes). Concurrent execution would leak between threads even within a single class.
  • @EnableKubernetesMockClient instantiates one KubernetesMockServer per test class; concurrent test methods within a class would race over its expectations.
  • Items 1–3 should produce most of the achievable savings without this complexity.

If items 1–3 still leave PR builds too slow, reopen this as its own ticket with a per-module audit.

8. Move javadocs.yml and generate-model.yml off macos-latest (lower priority)

.github/workflows/javadocs.yml:43 and .github/workflows/generate-model.yml:41 both use runs-on: macos-latest. The first runs make javadoc (pure Java); the second runs make generate-model (Go + Java). Neither needs Apple toolchains.

The original framing here ("macOS runners are slower per minute for CPU-bound work") was wrong and has been corrected. As of August 2025, macos-latest resolves to macos-15 on arm64 (Apple Silicon, M1-class, GA'd May 2024), and per-core single-thread is faster than the standard 2-vCPU Linux x64 runner (~3,000+ vs ~2,200 Passmark, per RunsOn benchmarks). The actual reasons to move are hygiene, not raw speed:

  • Billing multiplier: macOS minutes are billed at 10× and Linux at 1× (GitHub billing docs, actions runner pricing). Public-repo builds are free regardless, but the multiplier still applies the moment any minute pool is involved, and it's a poor signal to set for OSS resource use.
  • Disk I/O: macOS runners have long-documented slow random-access disk performance (actions/virtual-environments#2707). Maven dependency resolution and Javadoc generation are I/O-heavy, which erodes the per-core CPU edge.
  • Capacity / queueing: macOS runner capacity is more incident-prone — e.g. the October 1 2025 outage hit ~96 % error rate for ~10 hours, and macos-15 x86 had a sustained slowness period (actions/runner-images#12545). arm64 is healthier but queue times still trend longer.
  • No Apple toolchain need: make javadoc is pure Java, and make generate-model is Go + Java — both are run on Linux locally by every contributor.

Proposed change: switch both to runs-on: ubuntu-latest. For generate-model.yml, the only thing to verify is that actions/setup-go@v6 + actions/setup-java@v5 produce the same generator output on Linux (they should — make generate-model is run on Linux locally by every contributor).

Expected impact: not a wall-time win on the PR critical path. Mainly hygiene — stop charging macOS minutes for work that has nothing macOS-specific, and avoid macOS runner outages blocking otherwise-Linux jobs. Demoted to lower priority because the productivity gain is minor.

Risks: none for javadocs.yml. For generate-model.yml, run once on Linux and confirm kube-schema.json/generated-java diff is empty.

Explicitly out of scope

  • Propagate reuseForks=true to other modules. Premise was stale: as of post-chore(kubernetes-client): enable surefire fork reuse #7673, no module sets reuseForks=false anywhere in the tree (verified by grep -rn 'reuseForks' --include='pom.xml'). Surefire 3.5.5's default is forkCount=1, reuseForks=true; the three modules with overrides (kubernetes-tests, kubernetes-client-api, kubernetes-client) just re-state the default plus, for kubernetes-tests, the 1C count. Verified against mvn help:effective-pom on openshift-client. Nothing to propagate.
  • Remove parent <parallel>suitesAndClasses</parallel>. Already removed in PR chore: speed up kubernetes-tests with surefire fork reuse #7657 (the same commit that flipped kubernetes-tests).
  • useIncrementalCompilation=false. Set in the root pom (pom.xml:84) and four extension client modules. This is intentional belt-and-braces against sundrio/lombok annotation-processing inconsistencies. maven-compiler-plugin 3.15.0 (in use) does support the newer incrementalCompilation parameter, but there's no concrete reproducer for the original problem and reverting trades a known-good cold build for an unknown incremental risk. Skip.
  • mvnd for local dev. Worth recommending in doc/ (Apache Maven Daemon's reusable JVM saves multi-second startup per mvn invocation), but it's a contributor preference, not a CI change. No code change here.
  • Cut a Java cell from the matrix. The Java 11/17/21 + Windows-17 matrix exists because the project ships against Java 11+ and switches Mockito version + Karaf itests by JDK (pom.xml:1586 java-17 profile). Each cell exercises a different code path; collapsing them risks shipping a regression that only shows on the dropped JDK. Skip.
  • mockito-inline:5.2.0 dep removal as a "speed" item. Filed as item 6 because it's worth doing for hygiene; the wall-time impact is in the noise.
  • Kubernetes-itests / e2e cell count. e2e-tests.yml runs kubernetes-itests against 8 K8s versions on PRs. Each version is required to catch version-specific behavior; this is the project's stated compatibility surface. Out of scope.

Rough combined expectation

Sum-of-parts is misleading because cells run in parallel, so report per-cell:

  • build.yml per-cell wall time, with items 1+2: ~1–2 min faster (depends on how -T 1C overlaps the model-generator chain on a 2-core runner).
  • kubernetes-tests (inside each cell), with item 3: ~2–3 min faster.
  • Items 4, 5, 6, 7, 8: no measurable wall-time change (hygiene / regression insurance).

Net expected PR-merge readiness improvement: another ~30–40 % on top of #7648's gains, mostly from items 1 and 3.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions