Skip to content

ci: Ginkgo CLI / runtime version mismatch likely causing AfterSuite timeout in unit-and-integration-tests #695

Description

Summary

The unit-and-integration-tests GitHub Actions job is failing on multiple recent PRs at the same point: the pkg/controllers/workapplier AfterSuite teardown times out after the 30s grace period. CI logs surface a Ginkgo CLI/runtime version mismatch which is the most likely cause but not yet confirmed — filing this so the right person can verify before changing anything.

Observed in CI

From unit-and-integration-tests job logs (e.g. https://github.com/kubefleet-dev/kubefleet/actions/runs/25395689081/job/74481731598?pr=691):

Ginkgo detected a version mismatch between the Ginkgo CLI and the version of Ginkgo imported by your packages:
  Ginkgo CLI Version:
    2.19.1
  Mismatched package versions found:
    2.23.4 used by workapplier

Then later:

[FAILED] in [AfterSuite] - pkg/controllers/workapplier/suite_test.go:467
[FAILED] Expected success, but got an error:
    failed waiting for all runnables to end within grace period of 30s: context deadline exceeded

Ran 290 of 290 Specs in 465.500 seconds
FAIL! -- 290 Passed | 0 Failed | 0 Pending | 0 Skipped

All 290 specs pass — the failure is purely in suite teardown.

Likely root cause

The repo pins two different Ginkgo CLI versions across workflow jobs:

$ grep "ginkgo/v2/ginkgo@v" .github/workflows/ci.yml
go install github.com/onsi/ginkgo/v2/ginkgo@v2.19.1   # unit-and-integration-tests
go install github.com/onsi/ginkgo/v2/ginkgo@v2.23.4   # other job

go.mod has github.com/onsi/ginkgo/v2 v2.23.4. The @v2.19.1 install was added in Aug 2024 and never bumped when the package import was updated. The other @v2.23.4 install was bumped at some point.

.github/workflows/upgrade.yml also has three @v2.19.1 references.

Reproduces on every PR

Recent CI runs across unrelated PRs:

Time PR Result
21:36:45 configureUpdateRunThreshold failure
21:32:41 copilot/fix-timedwait-invalid-time failure
18:48:20 fix/override-snapshot-transition-race failure
18:04:22 copilot/refactor-policyobservedclus failure
17:54:41 configureUpdateRunThreshold failure
17:42:39 copilot/fix-timedwait-invalid-time success ← last green

It's hardcoded in the workflow, so every PR triggering this job hits the same failure point.

Proposed fix (needs verification)

Bump the pinned Ginkgo CLI in the unit-test job to match go.mod:

-          go install github.com/onsi/ginkgo/v2/ginkgo@v2.19.1
+          go install github.com/onsi/ginkgo/v2/ginkgo@v2.23.4

Same fix in .github/workflows/upgrade.yml (3 occurrences).

What I'm NOT certain about

I'm filing this rather than sending a PR straight away because the chain "version mismatch warning → AfterSuite teardown timeout" is plausible correlation but I haven't directly proven causation. Other possible causes I haven't ruled out:

  1. The warning is just noise; the real timeout has a different root cause (e.g. recent workapplier shutdown logic regression, slow envtest API-server shutdown, runner resource contention).
  2. A specific change in pkg/controllers/workapplier that introduced a slow shutdown path coincident with the Ginkgo bump.

A quick way to verify before merging the fix:

  • Check Ginkgo CHANGELOG between v2.19.1 and v2.23.4 for changes to AfterSuite / runnable-drain / grace-period semantics.
  • See if the workapplier failures started exactly when go.mod bumped Ginkgo (or when 30s grace was set).
  • Try the CLI bump on a test branch and watch a few CI runs.

Happy to send the fix PR if a maintainer confirms the diagnosis (or wants to land it speculatively given how broad the impact is right now).

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions