Requeue when git API lags behind webhook commit#4842
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a race condition for GitRepo reconciliation when spec.disablePolling: true: if a webhook provides a new commit but the git provider API hasn’t propagated it yet, the controller should requeue (instead of persisting a stale API commit into status and missing the change).
Changes:
- Add requeue logic when the fetched API commit doesn’t match a pending
status.webhookCommit, restoringstatus.committo the pre-reconcile value before requeuing. - Add an integration test that reproduces the “webhook arrives before API consistency” scenario and verifies a job is created once the API “catches up”.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
internal/cmd/controller/gitops/reconciler/gitjob_controller.go |
Adds mismatch detection between API commit and webhook commit to trigger a delayed requeue without losing the pre-reconcile commit. |
integrationtests/gitjob/controller/controller_test.go |
Adds an integration test covering the webhook-vs-stale-API race when polling is disabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
When disablePolling is true and a webhook delivers a new commit, the reconciler calls LatestCommit to confirm it against the git server API. If the provider's API has not yet propagated the push, LatestCommit returns the old commit, which was written into Status.Commit and overwrote the webhook value — preventing any deployment job from being created. When the fetched commit equals the previously-deployed commit but differs from Status.WebhookCommit, restore Status.Commit to its pre-reconcile value and requeue after the default interval. Using commit == oldCommit as the guard ensures that if the API has already advanced to some other commit, it is trusted unconditionally — no requeue is issued and an infinite loop is avoided.
ce2fffa to
baa0663
Compare
When disablePolling is true and the git server API never returns the webhook-announced commit, the reconciler would silently requeue every 5 s indefinitely with no user-visible indication of the problem. Track when stale requeuing began in a new GitRepoStatus field WebhookCommitStaleSince. Once the elapsed time exceeds GitPollingStaleTimeout (default 5 min), set the GitPolling condition to False with reason StaleAPI so the problem is visible via kubectl and the UI. The condition clears automatically once the API returns the expected commit. Clear the timestamp on LatestCommit errors so the timer tracks only the "consistently returns old commit" case.
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition when spec.disablePolling: true where a webhook-delivered commit could be overwritten by a stale LatestCommit() result from the Git provider API, preventing job creation. It also surfaces a persistent webhook/API mismatch as a visible GitPolling condition error after a configurable timeout.
Changes:
- Requeue (without overwriting
status.commit) when the Git provider API still returns the previously-deployed commit while a newerstatus.webhookCommitis pending. - Add
status.webhookCommitStaleSinceand, afterGitPollingStaleTimeout(default 5m), setGitPolling=Falsewith reasonStaleAPIuntil the API catches up. - Add/adjust integration tests to cover API lag, out-of-order webhook arrival, and permanently stale API behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/durations/durations.go | Adds default stale-timeout constant for webhook/API mismatch handling. |
| pkg/apis/fleet.cattle.io/v1alpha1/gitrepo_types.go | Introduces WebhookCommitStaleSince in GitRepoStatus. |
| pkg/apis/fleet.cattle.io/v1alpha1/zz_generated.deepcopy.go | Updates deep-copy generation for the new status field. |
| internal/cmd/controller/gitops/reconciler/gitjob_controller.go | Implements requeue/restore behavior and stale timeout condition surfacing; persists new status field. |
| internal/cmd/controller/gitops/operator.go | Wires timeout configuration into the reconciler. |
| integrationtests/gitjob/controller/suite_test.go | Sets a short stale timeout for faster integration testing. |
| integrationtests/gitjob/controller/controller_test.go | Adds integration coverage for API lag, API-ahead webhook, and permanently stale API cases. |
| charts/fleet-crd/templates/crds.yaml | Updates CRD schema to include webhookCommitStaleSince. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
If DisablePolling is turned back on or WebhookCommit is cleared after the stale tracker was set, the field would persist indefinitely because the code path that clears it was never reached. Clear it proactively whenever the condition it tracks cannot be active.
|
Closing for now since we went for a simpler fix. |
When
disablePolling: true, a webhook delivers a new commit viaStatus.WebhookCommitand the reconciler callsLatestCommitto confirm it against the git server API. If the provider's API hasn't yet propagated the push,LatestCommitreturns the previously-deployed commit. The original code wrote that stale value intoStatus.Commit, overwriting the webhook commit and causingshouldCreateJobto see no diff — so no deployment job was ever created.Status.WebhookCommit, restoreStatus.Committo its pre-reconcile value before requeueing. Without this restore, the status write would persistWebhookCommitasStatus.Commit, making the following reconcile blind to the change even after the API catches up.GitRepoStatus.WebhookCommitStaleSincefield tracks when the reconciler first detected the mismatch. AfterGitPollingStaleTimeout(default 5 min), theGitPollingcondition is set toFalsewith reasonStaleAPI, making the problem visible viakubectl get gitrepoand the Fleet UI. The condition clears automatically once the API returns the expected commit. The timer resets onLatestCommiterrors so it tracks the "consistently returns old commit" case rather than an unreachable API (which is already reported separately).Refers #4837