feat(RELEASE-2120): add managed pipeline retry with mitigations#1683
feat(RELEASE-2120): add managed pipeline retry with mitigations#1683seanconroy2021 wants to merge 1 commit into
Conversation
|
🤖 Finished Review · ✅ Success · Started 8:54 PM UTC · Completed 9:15 PM UTC |
PR Reviewer Guide 🔍(Review updated until commit d1fbb50)Warning
Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Warning
Inline suggestions were posted as code suggestions. |
ReviewFindingsHigh
Medium
Labels: PR renames an exported Go type (ManagedPipelineAttempt to PipelineAttempt), which is a breaking change for external consumers. Previous runReviewFindingsHigh
Medium
Previous run (2)ReviewFindingsMedium
Low
Previous run (3)ReviewFindingsMedium
Low
Labels: PR introduces a new retry feature for managed pipelines with OOM/timeout mitigations. Previous run (4)ReviewFindingsMedium
Low
Previous run (5)ReviewFindingsMedium
Low
Info
Previous run (6)ReviewFindingsHigh
Medium
Low
Info
Previous run (7)ReviewFindingsMedium
Low
|
|
/retest |
a155f18 to
6c427aa
Compare
|
🤖 Finished Review · ✅ Success · Started 9:32 AM UTC · Completed 9:50 AM UTC |
|
🤖 Review · Started 10:47 AM UTC |
PR Code Suggestions ✨Warning
Inline suggestions were posted as code suggestions. |
7e0aa0a to
e813d0a
Compare
|
🤖 Review · Started 10:59 AM UTC |
PR Code Suggestions ✨Warning
No code suggestions found for the PR. |
|
🤖 Finished Review · ✅ Success · Started 10:59 AM UTC · Completed 11:13 AM UTC |
happybhati
left a comment
There was a problem hiding this comment.
Great work, LGTM
e2e happypath failing seems unrelated.
|
/retest |
I don't think it has been updated. Have deleted the CodeCov comment (normally triggers a new one). Will fix anything I am missing. |
ae49387 to
b19e140
Compare
PR Code Suggestions ✨Warning
Inline suggestions were posted as code suggestions. |
| func ApplyPipelineTimeoutMitigation(current *tektonv1.TimeoutFields, mitigation *v1alpha1.TimeoutIncrement) *tektonv1.TimeoutFields { | ||
| if mitigation == nil { | ||
| return current | ||
| } | ||
| adjustedTimeouts := copyTimeoutFields(current) | ||
| newPipelineTimeout := addCappedDuration(adjustedTimeouts.Pipeline, mitigation) | ||
| adjustedTimeouts.Pipeline = &metav1.Duration{Duration: newPipelineTimeout} | ||
|
|
||
| return adjustedTimeouts | ||
| } |
There was a problem hiding this comment.
Suggestion: Call enforcePipelineCeiling to ensure the mitigated pipeline timeout remains valid. [possible issue, importance: 8]
| func ApplyPipelineTimeoutMitigation(current *tektonv1.TimeoutFields, mitigation *v1alpha1.TimeoutIncrement) *tektonv1.TimeoutFields { | |
| if mitigation == nil { | |
| return current | |
| } | |
| adjustedTimeouts := copyTimeoutFields(current) | |
| newPipelineTimeout := addCappedDuration(adjustedTimeouts.Pipeline, mitigation) | |
| adjustedTimeouts.Pipeline = &metav1.Duration{Duration: newPipelineTimeout} | |
| return adjustedTimeouts | |
| } | |
| func ApplyPipelineTimeoutMitigation(current *tektonv1.TimeoutFields, mitigation *v1alpha1.TimeoutIncrement) *tektonv1.TimeoutFields { | |
| if mitigation == nil { | |
| return current | |
| } | |
| adjustedTimeouts := copyTimeoutFields(current) | |
| newPipelineTimeout := addCappedDuration(adjustedTimeouts.Pipeline, mitigation) | |
| adjustedTimeouts.Pipeline = &metav1.Duration{Duration: newPipelineTimeout} | |
| enforcePipelineCeiling(adjustedTimeouts) | |
| return adjustedTimeouts | |
| } |
|
🤖 Review · ❌ Terminated · Started 5:09 PM UTC · Ended 5:15 PM UTC |
|
🤖 Finished Review · ❌ Failure · Started 5:09 PM UTC · Completed 5:15 PM UTC |
happybhati
left a comment
There was a problem hiding this comment.
Hey @seanconroy2021 is the codecov report valid?
#1683 (comment)
I think it's correct. I am missing tests for computeRetryOverrides and retryManagedPipeline. Adding them back in now :) |
b19e140 to
d1fbb50
Compare
Have gone through the codecov and added in any missing tests |
|
🤖 Review · ❌ Terminated · Started 7:55 AM UTC · Ended 8:11 AM UTC |
PR Code Suggestions ✨Warning
Inline suggestions were posted as code suggestions. |
| cleaned, err := adapter.loader.GetReleasePipelineRunAttempt(adapter.ctx, adapter.client, adapter.release, 0) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| Expect(cleaned).NotTo(BeNil()) | ||
| Expect(cleaned.Finalizers).To(HaveLen(0)) | ||
| Expect(k8sClient.Delete(ctx, firstPipelineRun)).To(Succeed()) |
There was a problem hiding this comment.
Suggestion: Use the newly fetched cleaned object for deletion instead of the original pointer. [possible issue, importance: 6]
| cleaned, err := adapter.loader.GetReleasePipelineRunAttempt(adapter.ctx, adapter.client, adapter.release, 0) | |
| Expect(err).NotTo(HaveOccurred()) | |
| Expect(cleaned).NotTo(BeNil()) | |
| Expect(cleaned.Finalizers).To(HaveLen(0)) | |
| Expect(k8sClient.Delete(ctx, firstPipelineRun)).To(Succeed()) | |
| cleaned, err := adapter.loader.GetReleasePipelineRunAttempt(adapter.ctx, adapter.client, adapter.release, 0) | |
| Expect(err).NotTo(HaveOccurred()) | |
| Expect(cleaned).NotTo(BeNil()) | |
| Expect(cleaned.Finalizers).To(HaveLen(0)) | |
| Expect(k8sClient.Delete(ctx, cleaned)).To(Succeed()) |
| pipelineRun, err := a.createManagedPipelineRun(resources, taskRunSpecs, timeouts) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| return a.cleanupPipelineResources(pipelineRun, roleBindings...) | ||
| a.logger.Info(fmt.Sprintf("Created %s Release PipelineRun (retry)", metadata.ManagedPipelineType), | ||
| "PipelineRun.Name", pipelineRun.Name, "PipelineRun.Namespace", pipelineRun.Namespace, | ||
| "Attempt", len(a.release.Status.ManagedPipelineAttempts)) | ||
|
|
||
| return a.registerManagedProcessingData(pipelineRun, tenantRoleBinding) |
There was a problem hiding this comment.
Suggestion: Add a check to reuse an existing PipelineRun for the next attempt before creating a new one in retryManagedPipeline(). [possible issue, importance: 9]
| pipelineRun, err := a.createManagedPipelineRun(resources, taskRunSpecs, timeouts) | |
| if err != nil { | |
| return err | |
| } | |
| return a.cleanupPipelineResources(pipelineRun, roleBindings...) | |
| a.logger.Info(fmt.Sprintf("Created %s Release PipelineRun (retry)", metadata.ManagedPipelineType), | |
| "PipelineRun.Name", pipelineRun.Name, "PipelineRun.Namespace", pipelineRun.Namespace, | |
| "Attempt", len(a.release.Status.ManagedPipelineAttempts)) | |
| return a.registerManagedProcessingData(pipelineRun, tenantRoleBinding) | |
| nextAttempt := len(a.release.Status.ManagedPipelineAttempts) | |
| existing, getErr := a.loader.GetReleasePipelineRunAttempt(a.ctx, a.client, a.release, nextAttempt) | |
| if getErr != nil && !errors.IsNotFound(getErr) { | |
| return getErr | |
| } | |
| if existing != nil { | |
| a.logger.Info(fmt.Sprintf("Reusing %s Release PipelineRun (retry)", metadata.ManagedPipelineType), | |
| "PipelineRun.Name", existing.Name, "PipelineRun.Namespace", existing.Namespace, | |
| "Attempt", nextAttempt) | |
| return a.registerManagedProcessingData(existing, tenantRoleBinding) | |
| } | |
| pipelineRun, err := a.createManagedPipelineRun(resources, taskRunSpecs, timeouts) | |
| if err != nil { | |
| return err | |
| } | |
| a.logger.Info(fmt.Sprintf("Created %s Release PipelineRun (retry)", metadata.ManagedPipelineType), | |
| "PipelineRun.Name", pipelineRun.Name, "PipelineRun.Namespace", pipelineRun.Namespace, | |
| "Attempt", nextAttempt) | |
| return a.registerManagedProcessingData(pipelineRun, tenantRoleBinding) |
Superseded by updated review
|
🤖 Finished Review · ✅ Success · Started 7:55 AM UTC · Completed 8:11 AM UTC |
d1fbb50 to
deb8caa
Compare
|
🤖 Review · ❌ Terminated · Started 8:29 AM UTC · Ended 8:43 AM UTC |
|
🤖 Finished Review · ❌ Failure · Started 8:29 AM UTC · Completed 8:43 AM UTC |
|
Scenario: konflux-e2e-tests
Inspecting Test ArtifactsTo inspect your test artifacts, follow these steps:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/konflux-test-storage/konflux-team/release-service:konflux-e2e-tests-zc87pTest results analysis🚨 Failed to provision a cluster, see the log for more details: Click to view logs�[37mDEBU�[0m running 'mapt aws kind create'
�[37mDEBU�[0m context initialized for mapte90bd553
�[37mDEBU�[0m checking stack spotOption-kind-konflux-e2e-tests-zc87p
�[37mDEBU�[0m managing stack spotOption-kind-konflux-e2e-tests-zc87p
�[36mINFO�[0m Updating (spotOption-kind-konflux-e2e-tests-zc87p):
�[36mINFO�[0m
�[36mINFO�[0m + pulumi:pulumi:Stack kind-konflux-e2e-tests-zc87p-spotOption-kind-konflux-e2e-tests-zc87p creating (0s)
�[37mDEBU�[0m Based on prices for instance types [m5a.4xlarge m6in.4xlarge m6a.4xlarge m6i.4xlarge m6id.4xlarge m5.4xlarge m5ad.4xlarge m4.4xlarge] is az ap-south-1a, current price is 0.27 with a score of 9
�[37mDEBU�[0m Spot data: {[m5a.4xlarge m6in.4xlarge m6a.4xlarge m6i.4xlarge m6id.4xlarge m5.4xlarge m5ad.4xlarge m4.4xlarge] 0.4932 ap-south-1 ap-south-1a 0}
�[36mINFO�[0m @ updating.....
�[36mINFO�[0m + rh:qe:aws:bso bso-bso creating (0s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + pulumi:pulumi:Stack kind-konflux-e2e-tests-zc87p-spotOption-kind-konflux-e2e-tests-zc87p created (3s)
�[36mINFO�[0m + rh:qe:aws:bso bso-bso created
�[36mINFO�[0m Outputs:
�[36mINFO�[0m az : "ap-south-1a"
�[36mINFO�[0m instancetypes: [
�[36mINFO�[0m [0]: "m5a.4xlarge"
�[36mINFO�[0m [1]: "m6in.4xlarge"
�[36mINFO�[0m [2]: "m6a.4xlarge"
�[36mINFO�[0m [3]: "m6i.4xlarge"
�[36mINFO�[0m [4]: "m6id.4xlarge"
�[36mINFO�[0m [5]: "m5.4xlarge"
�[36mINFO�[0m [6]: "m5ad.4xlarge"
�[36mINFO�[0m [7]: "m4.4xlarge"
�[36mINFO�[0m ]
�[36mINFO�[0m max : 0.4932
�[36mINFO�[0m region : "ap-south-1"
�[36mINFO�[0m score : 0
�[36mINFO�[0m
�[36mINFO�[0m Resources:
�[36mINFO�[0m + 2 created
�[36mINFO�[0m
�[36mINFO�[0m Duration: 4s
�[36mINFO�[0m
�[37mDEBU�[0m managing stack stackKind-kind-konflux-e2e-tests-zc87p
�[36mINFO�[0m Updating (stackKind-kind-konflux-e2e-tests-zc87p):
�[36mINFO�[0m
�[36mINFO�[0m + pulumi:pulumi:Stack kind-konflux-e2e-tests-zc87p-stackKind-kind-konflux-e2e-tests-zc87p creating (0s)
�[36mINFO�[0m @ updating..............
�[36mINFO�[0m + aws:ec2:Vpc vpc-main-akd-net creating (0s)
�[36mINFO�[0m + aws:ec2:Eip main-akd-lbeip creating (0s)
�[36mINFO�[0m @ updating.....
�[36mINFO�[0m + tls:index:PrivateKey main-akd-pk creating (0s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + tls:index:PrivateKey main-akd-pk created (0.87s)
�[36mINFO�[0m + aws:ec2:Eip main-akd-lbeip created (2s)
�[36mINFO�[0m + aws:ec2:KeyPair main-akd-pk creating (0s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:ec2:Vpc vpc-main-akd-net created (3s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:lb:TargetGroup main-akd-tg-9443 creating (0s)
�[36mINFO�[0m + aws:lb:TargetGroup main-akd-tg-6443 creating (0s)
�[36mINFO�[0m + aws:ec2:SecurityGroup main-akd-sg creating (0s)
�[36mINFO�[0m + aws:ec2:KeyPair main-akd-pk created (1s)
�[36mINFO�[0m @ updating.....
�[36mINFO�[0m + aws:lb:TargetGroup main-akd-tg-22 creating (0s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:lb:TargetGroup main-akd-tg-9443 created (3s)
�[36mINFO�[0m + aws:lb:TargetGroup main-akd-tg-6443 created (3s)
�[36mINFO�[0m + aws:ec2:Subnet subnet-publicmain-akd-net0 creating (0s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:lb:TargetGroup main-akd-tg-22 created (2s)
�[36mINFO�[0m + aws:ec2:InternetGateway igw-main-akd-net creating (0s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:ec2:SecurityGroup default-main-akd-net-main-akd-net creating (0s)
�[36mINFO�[0m + aws:ec2:SecurityGroup main-akd-sg created (5s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:ec2:Subnet subnet-publicmain-akd-net0 created (2s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:lb:TargetGroup main-akd-tg-8888 creating (0s)
�[36mINFO�[0m + aws:ec2:InternetGateway igw-main-akd-net created (2s)
�[37mDEBU�[0m Requesting a spot instance of types: m5a.4xlarge, m6in.4xlarge, m6a.4xlarge, m6i.4xlarge, m6id.4xlarge, m5.4xlarge, m5ad.4xlarge, m4.4xlarge at ap-south-1a paying: 0.493200
�[36mINFO�[0m + aws:lb:LoadBalancer main-akd-lb creating (0s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:ec2:LaunchTemplate main-akd-lt creating (0s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:lb:TargetGroup main-akd-tg-8888 created (2s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:ec2:RouteTable routeTable-publicmain-akd-net0 creating (0s)
�[36mINFO�[0m + aws:ec2:SecurityGroup default-main-akd-net-main-akd-net created (5s)
�[36mINFO�[0m @ updating......
�[36mINFO�[0m + aws:ec2:RouteTable routeTable-publicmain-akd-net0 created (2s)
�[36mINFO�[0m + aws:ec2:RouteTableAssociation routeTableAssociation-publicmain-akd-net0 creating (0s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:ec2:RouteTableAssociation routeTableAssociation-publicmain-akd-net0 created (1s)
�[36mINFO�[0m @ updating.....
�[36mINFO�[0m + aws:ec2:LaunchTemplate main-akd-lt created (7s)
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + aws:autoscaling:Group main-akd-asg creating (0s)
�[36mINFO�[0m @ updating................
�[36mINFO�[0m + aws:autoscaling:Group main-akd-asg created (13s)
�[36mINFO�[0m @ updating...................................................................................................................................................
�[36mINFO�[0m + aws:lb:LoadBalancer main-akd-lb created (167s)
�[36mINFO�[0m @ updating.....
�[36mINFO�[0m + aws:lb:Listener main-akd-listener-8888 creating (0s)
�[36mINFO�[0m + aws:lb:Listener main-akd-listener-6443 creating (0s)
�[36mINFO�[0m @ updating.....
�[36mINFO�[0m + aws:lb:Listener main-akd-listener-8888 created (2s)
�[36mINFO�[0m + aws:lb:Listener main-akd-listener-22 creating (0s)
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (0s)
�[36mINFO�[0m + aws:lb:Listener main-akd-listener-6443 created (2s)
�[36mINFO�[0m + aws:lb:Listener main-akd-listener-9443 creating (0s)
�[36mINFO�[0m @ updating.....
�[36mINFO�[0m + aws:lb:Listener main-akd-listener-22 created (2s)
�[36mINFO�[0m @ updating.....
�[36mINFO�[0m + aws:lb:Listener main-akd-listener-9443 created (3s)
�[36mINFO�[0m @ updating...............
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (16s) Dial 1/inf failed: retrying
�[36mINFO�[0m @ updating...................
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (31s) Dial 2/inf failed: retrying
�[36mINFO�[0m @ updating..................
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (46s) Dial 3/inf failed: retrying
�[36mINFO�[0m @ updating..................
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (62s) Dial 4/inf failed: retrying
�[36mINFO�[0m @ updating..................
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (77s) Dial 5/inf failed: retrying
�[36mINFO�[0m @ updating.......
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (81s)
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (81s) status: done
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (81s) extended_status: done
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (81s) boot_status_code: enabled-by-generator
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (81s) last_update: Thu, 01 Jan 1970 00:00:53 +0000
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (81s) detail: DataSourceEc2Local
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (81s) errors: []
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd creating (81s) recoverable_errors: {}
�[36mINFO�[0m + command:remote:Command main-kind-readiness-akd-cmd created (81s) recoverable_errors: {}
�[36mINFO�[0m @ updating....
�[36mINFO�[0m + command:remote:Command main-kubeconfig-akd-cmd creating (0s)
�[36mINFO�[0m @ updating......
�[36mINFO�[0m + command:remote:Command main-kubeconfig-akd-cmd created (2s)
�[36mINFO�[0m @ updating.....
�[36mINFO�[0m + pulumi:pulumi:Stack kind-konflux-e2e-tests-zc87p-stackKind-kind-konflux-e... the content is too long - please download the artifact to see the full content
OCI Artifact Browser URL |
Retry managed PipelineRuns that fail with OOM or timeout. Generic errors are not retried. *Add EnsureManagedPipelineProcessingIsCompleted operation for retry or mark release as failed *Keep managed condition as Progressing during retries *Apply memory and timeout mitigations on each retry using the failed PipelineRun specs as the base *Bump pipeline timeouts when task timeout grows *Add attempt label for PipelineRun lookup *Handle cleanup across multiple PipelineRun attempts *Rename ManagedPipelineAttempt to PipelineAttempt *Add GetRoleBindingFromPipelineAttempt and GetReleasePipelineRunAttempt to loader Assisted-by: Claude Signed-off-by: Sean Conroy <sconroy@redhat.com>
deb8caa to
98c225b
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1683 +/- ##
==========================================
+ Coverage 87.43% 87.46% +0.03%
==========================================
Files 34 35 +1
Lines 3566 3846 +280
==========================================
+ Hits 3118 3364 +246
- Misses 285 303 +18
- Partials 163 179 +16
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
|
🤖 Finished Review · ❌ Failure · Started 9:50 AM UTC · Completed 10:07 AM UTC |
Retry managed PipelineRuns that fail with OOM or timeout. Generic errors are not retried.
Assisted-by: Claude