Summary
The unit-and-integration-tests GitHub Actions job is failing on multiple recent PRs at the same point: the pkg/controllers/workapplier AfterSuite teardown times out after the 30s grace period. CI logs surface a Ginkgo CLI/runtime version mismatch which is the most likely cause but not yet confirmed — filing this so the right person can verify before changing anything.
Observed in CI
From unit-and-integration-tests job logs (e.g. https://github.com/kubefleet-dev/kubefleet/actions/runs/25395689081/job/74481731598?pr=691):
Ginkgo detected a version mismatch between the Ginkgo CLI and the version of Ginkgo imported by your packages:
Ginkgo CLI Version:
2.19.1
Mismatched package versions found:
2.23.4 used by workapplier
Then later:
[FAILED] in [AfterSuite] - pkg/controllers/workapplier/suite_test.go:467
[FAILED] Expected success, but got an error:
failed waiting for all runnables to end within grace period of 30s: context deadline exceeded
Ran 290 of 290 Specs in 465.500 seconds
FAIL! -- 290 Passed | 0 Failed | 0 Pending | 0 Skipped
All 290 specs pass — the failure is purely in suite teardown.
Likely root cause
The repo pins two different Ginkgo CLI versions across workflow jobs:
$ grep "ginkgo/v2/ginkgo@v" .github/workflows/ci.yml
go install github.com/onsi/ginkgo/v2/ginkgo@v2.19.1 # unit-and-integration-tests
go install github.com/onsi/ginkgo/v2/ginkgo@v2.23.4 # other job
go.mod has github.com/onsi/ginkgo/v2 v2.23.4. The @v2.19.1 install was added in Aug 2024 and never bumped when the package import was updated. The other @v2.23.4 install was bumped at some point.
.github/workflows/upgrade.yml also has three @v2.19.1 references.
Reproduces on every PR
Recent CI runs across unrelated PRs:
| Time |
PR |
Result |
| 21:36:45 |
configureUpdateRunThreshold |
failure |
| 21:32:41 |
copilot/fix-timedwait-invalid-time |
failure |
| 18:48:20 |
fix/override-snapshot-transition-race |
failure |
| 18:04:22 |
copilot/refactor-policyobservedclus |
failure |
| 17:54:41 |
configureUpdateRunThreshold |
failure |
| 17:42:39 |
copilot/fix-timedwait-invalid-time |
success ← last green |
It's hardcoded in the workflow, so every PR triggering this job hits the same failure point.
Proposed fix (needs verification)
Bump the pinned Ginkgo CLI in the unit-test job to match go.mod:
- go install github.com/onsi/ginkgo/v2/ginkgo@v2.19.1
+ go install github.com/onsi/ginkgo/v2/ginkgo@v2.23.4
Same fix in .github/workflows/upgrade.yml (3 occurrences).
What I'm NOT certain about
I'm filing this rather than sending a PR straight away because the chain "version mismatch warning → AfterSuite teardown timeout" is plausible correlation but I haven't directly proven causation. Other possible causes I haven't ruled out:
- The warning is just noise; the real timeout has a different root cause (e.g. recent workapplier shutdown logic regression, slow envtest API-server shutdown, runner resource contention).
- A specific change in
pkg/controllers/workapplier that introduced a slow shutdown path coincident with the Ginkgo bump.
A quick way to verify before merging the fix:
- Check Ginkgo CHANGELOG between
v2.19.1 and v2.23.4 for changes to AfterSuite / runnable-drain / grace-period semantics.
- See if the workapplier failures started exactly when
go.mod bumped Ginkgo (or when 30s grace was set).
- Try the CLI bump on a test branch and watch a few CI runs.
Happy to send the fix PR if a maintainer confirms the diagnosis (or wants to land it speculatively given how broad the impact is right now).
Summary
The
unit-and-integration-testsGitHub Actions job is failing on multiple recent PRs at the same point: thepkg/controllers/workapplierAfterSuite teardown times out after the 30s grace period. CI logs surface a Ginkgo CLI/runtime version mismatch which is the most likely cause but not yet confirmed — filing this so the right person can verify before changing anything.Observed in CI
From
unit-and-integration-testsjob logs (e.g. https://github.com/kubefleet-dev/kubefleet/actions/runs/25395689081/job/74481731598?pr=691):Then later:
All 290 specs pass — the failure is purely in suite teardown.
Likely root cause
The repo pins two different Ginkgo CLI versions across workflow jobs:
go.modhasgithub.com/onsi/ginkgo/v2 v2.23.4. The@v2.19.1install was added in Aug 2024 and never bumped when the package import was updated. The other@v2.23.4install was bumped at some point..github/workflows/upgrade.ymlalso has three@v2.19.1references.Reproduces on every PR
Recent CI runs across unrelated PRs:
It's hardcoded in the workflow, so every PR triggering this job hits the same failure point.
Proposed fix (needs verification)
Bump the pinned Ginkgo CLI in the unit-test job to match
go.mod:Same fix in
.github/workflows/upgrade.yml(3 occurrences).What I'm NOT certain about
I'm filing this rather than sending a PR straight away because the chain "version mismatch warning → AfterSuite teardown timeout" is plausible correlation but I haven't directly proven causation. Other possible causes I haven't ruled out:
pkg/controllers/workapplierthat introduced a slow shutdown path coincident with the Ginkgo bump.A quick way to verify before merging the fix:
v2.19.1andv2.23.4for changes to AfterSuite / runnable-drain / grace-period semantics.go.modbumped Ginkgo (or when 30s grace was set).Happy to send the fix PR if a maintainer confirms the diagnosis (or wants to land it speculatively given how broad the impact is right now).