Skip to content

Requeue when git API lags behind webhook commit#4842

Closed
thardeck wants to merge 3 commits into
mainfrom
implement_4837
Closed

Requeue when git API lags behind webhook commit#4842
thardeck wants to merge 3 commits into
mainfrom
implement_4837

Conversation

@thardeck
Copy link
Copy Markdown
Collaborator

@thardeck thardeck commented Mar 18, 2026

When disablePolling: true, a webhook delivers a new commit via Status.WebhookCommit and the reconciler calls LatestCommit to confirm it against the git server API. If the provider's API hasn't yet propagated the push, LatestCommit returns the previously-deployed commit. The original code wrote that stale value into Status.Commit, overwriting the webhook commit and causing shouldCreateJob to see no diff — so no deployment job was ever created.

  • When the fetched commit differs from Status.WebhookCommit, restore Status.Commit to its pre-reconcile value before requeueing. Without this restore, the status write would persist WebhookCommit as Status.Commit, making the following reconcile blind to the change even after the API catches up.
  • If the git server API never returns the webhook-announced commit (broken API, network partition, or slow propagation beyond any reasonable deadline), the reconciler previously requeued silently every 5 s with no user-visible indication. A new GitRepoStatus.WebhookCommitStaleSince field tracks when the reconciler first detected the mismatch. After GitPollingStaleTimeout (default 5 min), the GitPolling condition is set to False with reason StaleAPI, making the problem visible via kubectl get gitrepo and the Fleet UI. The condition clears automatically once the API returns the expected commit. The timer resets on LatestCommit errors so it tracks the "consistently returns old commit" case rather than an unreachable API (which is already reported separately).

Refers #4837

@thardeck thardeck self-assigned this Mar 18, 2026
@thardeck thardeck requested a review from a team as a code owner March 18, 2026 09:15
Copilot AI review requested due to automatic review settings March 18, 2026 09:15
@thardeck thardeck added this to Fleet Mar 18, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a race condition for GitRepo reconciliation when spec.disablePolling: true: if a webhook provides a new commit but the git provider API hasn’t propagated it yet, the controller should requeue (instead of persisting a stale API commit into status and missing the change).

Changes:

  • Add requeue logic when the fetched API commit doesn’t match a pending status.webhookCommit, restoring status.commit to the pre-reconcile value before requeuing.
  • Add an integration test that reproduces the “webhook arrives before API consistency” scenario and verifies a job is created once the API “catches up”.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
internal/cmd/controller/gitops/reconciler/gitjob_controller.go Adds mismatch detection between API commit and webhook commit to trigger a delayed requeue without losing the pre-reconcile commit.
integrationtests/gitjob/controller/controller_test.go Adds an integration test covering the webhook-vs-stale-API race when polling is disabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread internal/cmd/controller/gitops/reconciler/gitjob_controller.go Outdated
When disablePolling is true and a webhook delivers a new commit, the
reconciler calls LatestCommit to confirm it against the git server API.
If the provider's API has not yet propagated the push, LatestCommit
returns the old commit, which was written into Status.Commit and
overwrote the webhook value — preventing any deployment job from being
created.

When the fetched commit equals the previously-deployed commit but
differs from Status.WebhookCommit, restore Status.Commit to its
pre-reconcile value and requeue after the default interval. Using
commit == oldCommit as the guard ensures that if the API has already
advanced to some other commit, it is trusted unconditionally — no
requeue is issued and an infinite loop is avoided.
@kkaempf kkaempf added this to the v2.14.1 milestone Mar 18, 2026
@thardeck thardeck marked this pull request as draft March 18, 2026 09:58
When disablePolling is true and the git server API never returns the
webhook-announced commit, the reconciler would silently requeue every
5 s indefinitely with no user-visible indication of the problem.

Track when stale requeuing began in a new GitRepoStatus field
WebhookCommitStaleSince. Once the elapsed time exceeds
GitPollingStaleTimeout (default 5 min), set the GitPolling condition to
False with reason StaleAPI so the problem is visible via kubectl and the
UI. The condition clears automatically once the API returns the expected
commit. Clear the timestamp on LatestCommit errors so the timer tracks
only the "consistently returns old commit" case.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition when spec.disablePolling: true where a webhook-delivered commit could be overwritten by a stale LatestCommit() result from the Git provider API, preventing job creation. It also surfaces a persistent webhook/API mismatch as a visible GitPolling condition error after a configurable timeout.

Changes:

  • Requeue (without overwriting status.commit) when the Git provider API still returns the previously-deployed commit while a newer status.webhookCommit is pending.
  • Add status.webhookCommitStaleSince and, after GitPollingStaleTimeout (default 5m), set GitPolling=False with reason StaleAPI until the API catches up.
  • Add/adjust integration tests to cover API lag, out-of-order webhook arrival, and permanently stale API behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/durations/durations.go Adds default stale-timeout constant for webhook/API mismatch handling.
pkg/apis/fleet.cattle.io/v1alpha1/gitrepo_types.go Introduces WebhookCommitStaleSince in GitRepoStatus.
pkg/apis/fleet.cattle.io/v1alpha1/zz_generated.deepcopy.go Updates deep-copy generation for the new status field.
internal/cmd/controller/gitops/reconciler/gitjob_controller.go Implements requeue/restore behavior and stale timeout condition surfacing; persists new status field.
internal/cmd/controller/gitops/operator.go Wires timeout configuration into the reconciler.
integrationtests/gitjob/controller/suite_test.go Sets a short stale timeout for faster integration testing.
integrationtests/gitjob/controller/controller_test.go Adds integration coverage for API lag, API-ahead webhook, and permanently stale API cases.
charts/fleet-crd/templates/crds.yaml Updates CRD schema to include webhookCommitStaleSince.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread internal/cmd/controller/gitops/reconciler/gitjob_controller.go
If DisablePolling is turned back on or WebhookCommit is cleared after the
stale tracker was set, the field would persist indefinitely because the
code path that clears it was never reached. Clear it proactively whenever
the condition it tracks cannot be active.
@thardeck thardeck moved this to 🏗 In progress in Fleet Mar 18, 2026
@thardeck
Copy link
Copy Markdown
Collaborator Author

Closing for now since we went for a simpler fix.

@thardeck thardeck closed this Mar 19, 2026
@github-project-automation github-project-automation Bot moved this from 🏗 In progress to ✅ Done in Fleet Mar 19, 2026
@thardeck thardeck deleted the implement_4837 branch March 24, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants