feat: pause auto-promotion after rollback#6334
Open
jacobboykin wants to merge 18 commits into
Open
Conversation
✅ Deploy Preview for docs-kargo-io ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
1491789 to
18a0ed0
Compare
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
c60a74e to
19e84b8
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6334 +/- ##
==========================================
+ Coverage 58.47% 59.00% +0.52%
==========================================
Files 501 505 +4
Lines 42141 43139 +998
==========================================
+ Hits 24642 25454 +812
- Misses 16011 16131 +120
- Partials 1488 1554 +66 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: Rafal Pelczar <rafal@akuity.io>
Reduce the defensive machinery around auto-promotion holds to the two correctness boundaries that actually prevent a spurious deployment, and lean on eventual consistency plus the controller reconcile loop for the rest. - The Promotion controller is now the single hard "deployment gate": it re-reads the live Stage and aborts a held auto-promotion before any promotion steps run. Removed the redundant Stage-controller early-abort (abortAutoPromotionsBlockedByHolds and its helpers); a held auto-promotion is simply left for the Promotion controller to abort one reconcile later, which is cosmetic, not a missed deployment. - autoPromoteFreight keeps a single "creation gate" hold check, now documented as load-bearing: abort-by-hold is retryable, so without it the controller would create and abort Promotions in a tight loop while a hold is active. - Removed the unreachable CreatedAt==nil pending-hold path and its helper. The API server always stamps CreatedAt before creating the Promotion. - Simplified createPendingAutoPromotionHold: the candidate precondition is checked once up front instead of being re-validated inside every optimistic-lock retry. Kept the ambiguous-create cleanup, which guards a real deploy-stomp, and documented why. - Deduplicated the hold timestamp comparator into api.AutoPromotionHoldTimesEqual and dropped the unreachable resume State!=Active branch. - Bounded AutoPromotionHold.Reason at the schema level (MaxLength=1024) and documented the candidates endpoint's RBAC model. Add regression tests: hold preservation across the broad status patch, the no-hot-loop creation gate, and the resume-auto-promotion CLI validate(). No behavioral change to the no-bad-deploy guarantee; net ~185 fewer lines of production code. Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
jessesuen
reviewed
Jun 4, 2026
hairyhum
reviewed
Jun 5, 2026
Address review feedback on the auto-promotion-hold work: - Remove the now-vestigial RegularStageReconciler.autoPromotionAllowed wrapper; call api.IsAutoPromotionEnabled directly at its sole caller and consolidate its test coverage into pkg/api. - Replace AutoPromotionHold.Freight (a full FreightReference) with FreightName + Origin, the only fields ever set or read. This drops the always-empty artifact arrays from the Stage CRD schema. - Apply the same reduction to the REST autoPromotionCandidate type. Regenerate protobuf, deepcopy, CRDs, swagger, and the Go/TS clients. Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
- Guard abortAutoPromotion against a Promotion deleted between a status patch conflict and the retry's re-read (recovered panic) - Clear auto-promotion holds when a regular Stage is converted to a control flow Stage, which would otherwise orphan them - Remove a pending hold with no Promotion name as malformed instead of failing the Stage sync forever on empty-name reads - Sanitize and log 5xx StatusErrors in the REST error middleware instead of returning internal details to clients verbatim - Clear the machine abort-reason annotation on user-requested termination so a user abort is never mistaken for a hold abort - Validate PromotionPolicy label selectors in the ProjectConfig webhook; a malformed selector previously failed every promote in the project - Use state-aware conflict messages for hold-exists rejections; resuming cannot clear a pending hold, so stop recommending it - Stop emitting a lone-space line for comment-less messages in the generated API docs (trailing-whitespace source) - Correct comments that described cache-backed internal client reads as live; scope the auto-promotion pause docs to direct Stage promotion - Document the broadened meaning of the rollback annotation - Remove the auto-promotion-hold dev test harness Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
149d153 to
03b0705
Compare
…ning-auto-promotion-holds Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io> # Conflicts: # api/v1alpha1/generated.pb.go # pkg/server/rest_router.go # pkg/webhook/kubernetes/promotion/webhook_test.go # swagger.json # ui/src/gen/api/v1alpha1/generated_pb.ts # ui/src/gen/api/v2/models/index.ts
1d5dd27 to
76aa755
Compare
JSON unmarshaling converts metav1.Time to the local time zone, so deep equality on the round-tripped request is time-zone-dependent: it passed on machines whose UTC offset made time.Parse reuse the Local location, and failed deterministically on UTC CI runners. Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
jacobboykin
commented
Jun 11, 2026
| // are validated up front to keep the hold and its Promotion coherent. Status | ||
| // writes use c, so callers pass whichever client is allowed to patch Stage | ||
| // status (a controller's own client, or the API server's internal client). | ||
| func CreatePendingAutoPromotionHold( |
Member
Author
There was a problem hiding this comment.
This is the intended method for kargo EE to consume. For example:
promo, err := kargo.NewPromotionBuilder(c).Build(ctx, *stage, rollbackFreight.Name)
if err != nil {
return fmt.Errorf("error building rollback Promotion: %w", err)
}
if err = api.CreatePendingAutoPromotionHold(
ctx,
c, // client used for Stage status writes (and Promotion create, unless overridden)
client.ObjectKeyFromObject(stage),
promo,
*rollbackFreight,
api.AutoPromotionHoldOptions{
Actor: api.FormatEventControllerActor(controllerName),
Reason: "Rollback to last known-good Freight",
},
); err != nil {
if exists, ok := errors.AsType[*api.AutoPromotionHoldExistsError](err); ok {Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Relates to #3016
Summary
This PR teaches Kargo to remember when someone intentionally rolls a Stage back while auto-promotion is enabled.
Without this, a rollback can look like it worked and then the Stage controller can immediately auto-promote the Stage back to the newest candidate. That is technically consistent with auto-promotion, but it is surprising UX. The key change here is that a user-directed Promotion directly to a Stage, selecting Freight other than the current auto-promotion candidate, creates an auto-promotion hold for that Freight origin, and auto-promotion stays paused for that origin until someone promotes the current candidate or resumes it.
This PR is specifically about the auto-promotion-hold side of rollback behavior. It gives the UI a concrete way to show "auto-promotion is paused because this looked like a rollback".
Mental Model
There are two hold states:
Pending: the API has detected rollback intent and created a hold before creating the Promotion, but the Promotion has not succeeded yet. This is deliberately temporary. If the Promotion fails, errors, is aborted, or never gets created, the hold is removed.Active: the rollback Promotion succeeded, so the hold is now preserving that rollback. Active holds block auto-promotion for only that origin. Other origins requested by the Stage can keep moving.Resume only clears an active hold and refreshes the Stage. It does not force a Promotion. The normal Stage reconciler still decides whether there is a valid auto-promotion to create.
How It Works
Pendingentry tostage.status.autoPromotionHolds[origin]before creating the non-auto Promotion.ActiveThe exact-hold matching is intentional. Activate and clear paths compare the live hold against the observed hold identity before mutating status, including Freight, origin, Promotion name/UID, and created timestamp. That keeps stale caches or older Promotions from clearing newer rollback intent.
UI and UX
Roll back and pause auto-promotionaction.Pause pending/Auto-pausedtags for the selected origin. These are status labels onlyResumeaction button next toPromotewhen the Stage has holds.Resumebutton open the same holds popover.Resumeaction. Pending rows are disabled until the rollback Promotion settles.The multi-origin behavior is intentionally origin-scoped: resuming one origin does not resume the others.
API/CLI
This adds REST support for:
GET /v1beta1/projects/{project}/stages/{stage}/auto-promotion/candidatesPOST /v1beta1/projects/{project}/stages/{stage}/auto-promotion/resumeIt also adds
kargo resume-auto-promotion --stage ... --origin Warehouse/name.