Skip to content

feat: pause auto-promotion after rollback#6334

Open
jacobboykin wants to merge 18 commits into
mainfrom
jacobboykin/stage-pinning-auto-promotion-holds
Open

feat: pause auto-promotion after rollback#6334
jacobboykin wants to merge 18 commits into
mainfrom
jacobboykin/stage-pinning-auto-promotion-holds

Conversation

@jacobboykin

@jacobboykin jacobboykin commented May 22, 2026

Copy link
Copy Markdown
Member

Relates to #3016

Summary

This PR teaches Kargo to remember when someone intentionally rolls a Stage back while auto-promotion is enabled.

Without this, a rollback can look like it worked and then the Stage controller can immediately auto-promote the Stage back to the newest candidate. That is technically consistent with auto-promotion, but it is surprising UX. The key change here is that a user-directed Promotion directly to a Stage, selecting Freight other than the current auto-promotion candidate, creates an auto-promotion hold for that Freight origin, and auto-promotion stays paused for that origin until someone promotes the current candidate or resumes it.

This PR is specifically about the auto-promotion-hold side of rollback behavior. It gives the UI a concrete way to show "auto-promotion is paused because this looked like a rollback".

Mental Model

There are two hold states:

  • Pending: the API has detected rollback intent and created a hold before creating the Promotion, but the Promotion has not succeeded yet. This is deliberately temporary. If the Promotion fails, errors, is aborted, or never gets created, the hold is removed.
  • Active: the rollback Promotion succeeded, so the hold is now preserving that rollback. Active holds block auto-promotion for only that origin. Other origins requested by the Stage can keep moving.

Resume only clears an active hold and refreshes the Stage. It does not force a Promotion. The normal Stage reconciler still decides whether there is a valid auto-promotion to create.

How It Works

  • The promote API looks up the current auto-promotion candidate for the selected Freight origin.
  • If the user selected Freight other than the current auto-promotion candidate, it writes a Pending entry to stage.status.autoPromotionHolds[origin] before creating the non-auto Promotion.
  • The pending hold records the selected Freight, origin, actor/reason, Promotion name, and eventually the Promotion UID.
  • The Stage controller watches the linked Promotion:
    • success turns the hold Active
    • failure/error/abort removes the hold
    • a missing Promotion is tolerated briefly, then the hold is abandoned
  • The auto-promotion path checks live Stage status before creating auto Promotions. Holds block newer auto Promotions for that origin, and pending auto Promotions that lose the race to a hold are aborted with a clear message.
  • If the user promotes the current candidate while a hold is active, the API annotates that Promotion. When it succeeds, the controller clears that exact hold.

The exact-hold matching is intentional. Activate and clear paths compare the live hold against the observed hold identity before mutating status, including Freight, origin, Promotion name/UID, and created timestamp. That keeps stale caches or older Promotions from clearing newer rollback intent.

UI and UX

  • The Promote drawer warns before promoting Freight other than the current auto-promotion candidate.
  • For rollback Promotions, the optional reason field is placed in the drawer footer directly above the Roll back and pause auto-promotion action.
  • DAG Freight tiles show passive Pause pending / Auto-paused tags for the selected origin. These are status labels only
  • DAG Stage nodes show an auto-promotion-hold icon button in the node header when any origin is held.
  • List view shows a Resume action button next to Promote when the Stage has holds.
  • The DAG header button and list Resume button open the same holds popover.
  • Each origin row links to the held Freight and the current auto-promotion candidate
  • Active rows expose a per-origin Resume action. Pending rows are disabled until the rollback Promotion settles.

The multi-origin behavior is intentionally origin-scoped: resuming one origin does not resume the others.

API/CLI

This adds REST support for:

  • GET /v1beta1/projects/{project}/stages/{stage}/auto-promotion/candidates
  • POST /v1beta1/projects/{project}/stages/{stage}/auto-promotion/resume

It also adds kargo resume-auto-promotion --stage ... --origin Warehouse/name.

@netlify

netlify Bot commented May 22, 2026

Copy link
Copy Markdown

Deploy Preview for docs-kargo-io ready!

Name Link
🔨 Latest commit 0b79858
🔍 Latest deploy log https://app.netlify.com/projects/docs-kargo-io/deploys/6a2acc0efbf660000855802a
😎 Deploy Preview https://deploy-preview-6334.docs.kargo.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kargo-governance-bot kargo-governance-bot Bot added needs/area Issue or PR needs to be labeled to indicate what parts of the code base are affected needs/kind Issue or PR needs to be labeled to clarify its nature needs/priority Priority has not yet been determined; a good signal that maintainers aren't fully committed labels May 22, 2026
@jacobboykin jacobboykin changed the title Pause auto-promotion after rollback feat: pause auto-promotion after rollback May 22, 2026
@jacobboykin jacobboykin force-pushed the jacobboykin/stage-pinning-auto-promotion-holds branch from 1491789 to 18a0ed0 Compare May 22, 2026 03:28
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
@jacobboykin jacobboykin force-pushed the jacobboykin/stage-pinning-auto-promotion-holds branch from c60a74e to 19e84b8 Compare June 3, 2026 22:02
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.63265% with 323 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.00%. Comparing base (8296b8e) to head (0b79858).

Files with missing lines Patch % Lines
pkg/controller/stages/regular_stages.go 70.56% 64 Missing and 24 partials ⚠️
pkg/controller/promotions/promotions.go 61.25% 49 Missing and 25 partials ⚠️
...i/cmd/resumeautopromotion/resume_auto_promotion.go 20.77% 61 Missing ⚠️
pkg/server/promote_to_stage_v1alpha1.go 83.51% 21 Missing and 9 partials ⚠️
pkg/server/auto_promotion_v1alpha1.go 79.57% 19 Missing and 10 partials ⚠️
pkg/api/auto_promotion.go 91.62% 14 Missing and 5 partials ⚠️
pkg/cli/cmd/promote/promote.go 37.50% 10 Missing ⚠️
api/v1alpha1/request.go 0.00% 7 Missing ⚠️
pkg/cli/client/error.go 77.77% 1 Missing and 1 partial ⚠️
pkg/server/rest_router.go 87.50% 2 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6334      +/-   ##
==========================================
+ Coverage   58.47%   59.00%   +0.52%     
==========================================
  Files         501      505       +4     
  Lines       42141    43139     +998     
==========================================
+ Hits        24642    25454     +812     
- Misses      16011    16131     +120     
- Partials     1488     1554      +66     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

rpelczar and others added 2 commits June 4, 2026 02:06
Signed-off-by: Rafal Pelczar <rafal@akuity.io>
Reduce the defensive machinery around auto-promotion holds to the two
correctness boundaries that actually prevent a spurious deployment, and
lean on eventual consistency plus the controller reconcile loop for the
rest.

- The Promotion controller is now the single hard "deployment gate": it
  re-reads the live Stage and aborts a held auto-promotion before any
  promotion steps run. Removed the redundant Stage-controller early-abort
  (abortAutoPromotionsBlockedByHolds and its helpers); a held
  auto-promotion is simply left for the Promotion controller to abort one
  reconcile later, which is cosmetic, not a missed deployment.
- autoPromoteFreight keeps a single "creation gate" hold check, now
  documented as load-bearing: abort-by-hold is retryable, so without it
  the controller would create and abort Promotions in a tight loop while
  a hold is active.
- Removed the unreachable CreatedAt==nil pending-hold path and its helper.
  The API server always stamps CreatedAt before creating the Promotion.
- Simplified createPendingAutoPromotionHold: the candidate precondition is
  checked once up front instead of being re-validated inside every
  optimistic-lock retry. Kept the ambiguous-create cleanup, which guards a
  real deploy-stomp, and documented why.
- Deduplicated the hold timestamp comparator into
  api.AutoPromotionHoldTimesEqual and dropped the unreachable resume
  State!=Active branch.
- Bounded AutoPromotionHold.Reason at the schema level (MaxLength=1024) and
  documented the candidates endpoint's RBAC model.

Add regression tests: hold preservation across the broad status patch, the
no-hot-loop creation gate, and the resume-auto-promotion CLI validate().

No behavioral change to the no-bad-deploy guarantee; net ~185 fewer lines
of production code.

Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Comment thread pkg/api/auto_promotion.go Outdated
Comment thread api/v1alpha1/stage_types.go Outdated
Address review feedback on the auto-promotion-hold work:

- Remove the now-vestigial RegularStageReconciler.autoPromotionAllowed
  wrapper; call api.IsAutoPromotionEnabled directly at its sole caller and
  consolidate its test coverage into pkg/api.
- Replace AutoPromotionHold.Freight (a full FreightReference) with
  FreightName + Origin, the only fields ever set or read. This drops the
  always-empty artifact arrays from the Stage CRD schema.
- Apply the same reduction to the REST autoPromotionCandidate type.

Regenerate protobuf, deepcopy, CRDs, swagger, and the Go/TS clients.

Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
- Guard abortAutoPromotion against a Promotion deleted between a status
  patch conflict and the retry's re-read (recovered panic)
- Clear auto-promotion holds when a regular Stage is converted to a
  control flow Stage, which would otherwise orphan them
- Remove a pending hold with no Promotion name as malformed instead of
  failing the Stage sync forever on empty-name reads
- Sanitize and log 5xx StatusErrors in the REST error middleware instead
  of returning internal details to clients verbatim
- Clear the machine abort-reason annotation on user-requested termination
  so a user abort is never mistaken for a hold abort
- Validate PromotionPolicy label selectors in the ProjectConfig webhook;
  a malformed selector previously failed every promote in the project
- Use state-aware conflict messages for hold-exists rejections; resuming
  cannot clear a pending hold, so stop recommending it
- Stop emitting a lone-space line for comment-less messages in the
  generated API docs (trailing-whitespace source)
- Correct comments that described cache-backed internal client reads as
  live; scope the auto-promotion pause docs to direct Stage promotion
- Document the broadened meaning of the rollback annotation
- Remove the auto-promotion-hold dev test harness

Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
@jacobboykin jacobboykin force-pushed the jacobboykin/stage-pinning-auto-promotion-holds branch from 149d153 to 03b0705 Compare June 11, 2026 01:25
…ning-auto-promotion-holds

Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>

# Conflicts:
#	api/v1alpha1/generated.pb.go
#	pkg/server/rest_router.go
#	pkg/webhook/kubernetes/promotion/webhook_test.go
#	swagger.json
#	ui/src/gen/api/v1alpha1/generated_pb.ts
#	ui/src/gen/api/v2/models/index.ts
@jacobboykin jacobboykin force-pushed the jacobboykin/stage-pinning-auto-promotion-holds branch from 1d5dd27 to 76aa755 Compare June 11, 2026 01:52
JSON unmarshaling converts metav1.Time to the local time zone, so deep
equality on the round-tripped request is time-zone-dependent: it passed
on machines whose UTC offset made time.Parse reuse the Local location,
and failed deterministically on UTC CI runners.

Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
Comment thread pkg/api/auto_promotion.go
// are validated up front to keep the hold and its Promotion coherent. Status
// writes use c, so callers pass whichever client is allowed to patch Stage
// status (a controller's own client, or the API server's internal client).
func CreatePendingAutoPromotionHold(

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the intended method for kargo EE to consume. For example:

 promo, err := kargo.NewPromotionBuilder(c).Build(ctx, *stage, rollbackFreight.Name)
 if err != nil {
     return fmt.Errorf("error building rollback Promotion: %w", err)
 }

 if err = api.CreatePendingAutoPromotionHold(
     ctx,
     c, // client used for Stage status writes (and Promotion create, unless overridden)
     client.ObjectKeyFromObject(stage),
     promo,
     *rollbackFreight,
     api.AutoPromotionHoldOptions{
         Actor:  api.FormatEventControllerActor(controllerName),
         Reason: "Rollback to last known-good Freight",
     },
 ); err != nil {
     if exists, ok := errors.AsType[*api.AutoPromotionHoldExistsError](err); ok {

Signed-off-by: Jacob Boykin <jacob.boykin@akuity.io>
@jacobboykin jacobboykin marked this pull request as ready for review June 11, 2026 14:54
@jacobboykin jacobboykin requested review from a team as code owners June 11, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs/area Issue or PR needs to be labeled to indicate what parts of the code base are affected needs/kind Issue or PR needs to be labeled to clarify its nature needs/priority Priority has not yet been determined; a good signal that maintainers aren't fully committed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants