Skip to content

fix(KONFLUX-13012): clean up orphaned IRs before creating new ones#734

Open
swickersh wants to merge 1 commit into
konflux-ci:mainfrom
swickersh:konflux-13012
Open

fix(KONFLUX-13012): clean up orphaned IRs before creating new ones#734
swickersh wants to merge 1 commit into
konflux-ci:mainfrom
swickersh:konflux-13012

Conversation

@swickersh

@swickersh swickersh commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

When the managed task is retried (e.g. due to timeout), the previous InternalRequest and its PipelineRun on the internal cluster continue running. The retry creates a new InternalRequest, leading to concurrent operations that cause duplicate Pulp pushes and CGW "Record already present" errors.

Before creating a new InternalRequest, the script now deletes any existing InternalRequests with the same PipelineRun UID label. Combined with the finalizer change in internal-services(PR 709), this ensures the associated PipelineRun is cancelled before the new one starts.

Assisted-by: Cursor AI

@codecov-commenter

codecov-commenter commented Apr 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.65%. Comparing base (fbf0f92) to head (ec666a3).
⚠️ Report is 17 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #734      +/-   ##
==========================================
+ Coverage   95.37%   95.65%   +0.27%     
==========================================
  Files          63       72       +9     
  Lines        6318     7111     +793     
==========================================
+ Hits         6026     6802     +776     
- Misses        292      309      +17     
Flag Coverage Δ
unit-tests 95.65% <ø> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 14 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fbf0f92...ec666a3. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@swickersh

Copy link
Copy Markdown
Contributor Author

/ok-to-test

@swickersh swickersh force-pushed the konflux-13012 branch 3 times, most recently from 659455f to 0fc6ae5 Compare May 1, 2026 12:38
@swickersh

Copy link
Copy Markdown
Contributor Author

/retest

Comment thread utils/internal-request Outdated
@swickersh swickersh marked this pull request as ready for review May 5, 2026 15:18
@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Clean up orphaned InternalRequests before creating new ones

🐞 Bug fix 🧪 Tests

Grey Divider

Walkthroughs

Description
• Clean up orphaned InternalRequests before creating new ones
• Prevents duplicate Pulp pushes and CGW errors on task retry
• Deletes existing IRs with matching PipelineRun UID label
• Includes comprehensive test coverage for cleanup logic
Diagram
flowchart LR
  A["Task Retry Triggered"] --> B["Extract PipelineRun UID Label"]
  B --> C["Query Existing InternalRequests"]
  C --> D{"IRs Found?"}
  D -->|Yes| E["Delete Orphaned IRs"]
  E --> F["Wait 5 Seconds"]
  F --> G["Create New InternalRequest"]
  D -->|No| G
  G --> H["PipelineRun Proceeds"]
Loading

Grey Divider

File Changes

1. utils/internal-request 🐞 Bug fix +31/-0

Add orphaned InternalRequest cleanup logic

• Extracts pipelinerun-uid label value from provided labels
• Queries existing InternalRequests matching the PipelineRun UID
• Deletes all found orphaned InternalRequests with cleanup confirmation
• Waits 5 seconds to allow PipelineRun cancellation to propagate

utils/internal-request


2. utils/tests/test_internal_request.py 🧪 Tests +130/-0

Add comprehensive tests for InternalRequest cleanup

• Tests cleanup of existing InternalRequests when PipelineRun UID label present
• Tests skipping cleanup when no existing InternalRequests found
• Tests skipping cleanup when PipelineRun UID label is absent
• Mocks kubectl and sleep commands to verify correct cleanup sequence

utils/tests/test_internal_request.py


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented May 5, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (1)

Context used

Grey Divider


Action required

1. Cleanup ignores delete failures 🐞 Bug ☼ Reliability
Description
The orphan cleanup suppresses kubectl get/delete errors and continues to create a new
InternalRequest even if old InternalRequests were not actually deleted (e.g., delete times out at
60s), so the previous internal PipelineRun may keep running concurrently and the duplicate-work bug
can still occur.
Code

utils/internal-request[R283-292]

+if [ -n "$PIPELINERUN_UID_VALUE" ]; then
+    EXISTING_IRS=$(kubectl get internalrequest -l "${PIPELINERUN_UID_LABEL}=${PIPELINERUN_UID_VALUE}" \
+        --no-headers -o custom-columns=":metadata.name" 2>/dev/null || true)
+
+    if [ -n "$EXISTING_IRS" ]; then
+        echo "Found existing InternalRequests from prior attempts. Cleaning up..."
+        for IR_NAME in $EXISTING_IRS; do
+            echo "Deleting InternalRequest ${IR_NAME}..."
+            kubectl delete internalrequest "$IR_NAME" --wait=true --timeout=60s 2>/dev/null || true
+        done
Evidence
The script explicitly ignores failures from both listing and deleting existing InternalRequests
(stderr redirected and || true), then unconditionally creates a new InternalRequest afterwards. If
deletion doesn’t complete (timeout, API/RBAC issues, finalizer taking longer than 60s), the old IR
can remain and continue running while the new one starts.

utils/internal-request[283-300]
utils/internal-request[267-271]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`utils/internal-request` attempts to prevent concurrent internal executions by deleting prior InternalRequests with the same PipelineRun UID label, but it currently suppresses `kubectl` errors/timeouts and proceeds to create a new InternalRequest anyway. This defeats the purpose of the cleanup whenever deletion is slow or fails.

### Issue Context
- `kubectl delete ... --wait=true --timeout=60s` can legitimately exceed 60s when finalizers/controllers are involved.
- Because errors are silenced, operators also lose the signal that cleanup failed.

### Fix Focus Areas
- utils/internal-request[283-300]

### Suggested fix
- Do **not** redirect stderr to `/dev/null` and do **not** `|| true` the delete path.
- Use a more robust wait strategy:
 - Option A: `kubectl delete internalrequest -l "${PIPELINERUN_UID_LABEL}=${PIPELINERUN_UID_VALUE}" --wait=true --timeout=<larger>` and fail if non-zero.
 - Option B: Retry deletion/poll until no matching IRs remain (bounded by overall script timeout).
- If cleanup fails, `exit 1` so Tekton retries *without* starting a concurrent IR.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. sleep 5 without PipelineRun check 📎 Requirement gap ☼ Reliability
Description
After deleting prior InternalRequest CRs, the script only waits a fixed sleep 5 and then creates
a new InternalRequest without verifying the prior associated PipelineRun(s) have actually
stopped. This can allow the old PipelineRun to keep running in parallel with the new attempt if
cancellation propagation takes longer than 5 seconds.
Code

utils/internal-request[R293-300]

+        echo "Cleanup complete. Waiting for PipelineRun cancellation to propagate..."
+        sleep 5
+    fi
+fi
+
# Create InternalRequest using kubectl
RESOURCE=$(echo "$PAYLOAD" | kubectl create -f - -o json)
INTERNAL_REQUEST_NAME=$(echo "$RESOURCE" | jq -r '.metadata.name')
Evidence
PR Compliance ID 2 requires that PipelineRun(s) associated with prior InternalRequest(s) are
stopped on retry; however the new logic deletes the old InternalRequest(s) and proceeds after a
fixed sleep without confirming termination/cancellation of the associated PipelineRun(s).

Stop the PipelineRun associated with a prior InternalRequest when a retry occurs
utils/internal-request[293-300]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The retry cleanup deletes prior `InternalRequest` CRs but does not verify the associated `PipelineRun`(s) are actually stopped before creating a new `InternalRequest`.

## Issue Context
A fixed `sleep 5` is not a reliable guarantee that cancellation has completed; if cancellation is delayed, the old `PipelineRun` may still run concurrently with the new attempt.

## Fix Focus Areas
- utils/internal-request[283-300]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

3. Fragile IR list parsing 🐞 Bug ⚙ Maintainability
Description
The cleanup collects InternalRequest names via human-oriented kubectl output and then word-splits
it, so any unexpected tokens in the output will be treated as resource names and passed to kubectl
delete.
Code

utils/internal-request[R284-290]

+    EXISTING_IRS=$(kubectl get internalrequest -l "${PIPELINERUN_UID_LABEL}=${PIPELINERUN_UID_VALUE}" \
+        --no-headers -o custom-columns=":metadata.name" 2>/dev/null || true)
+
+    if [ -n "$EXISTING_IRS" ]; then
+        echo "Found existing InternalRequests from prior attempts. Cleaning up..."
+        for IR_NAME in $EXISTING_IRS; do
+            echo "Deleting InternalRequest ${IR_NAME}..."
Evidence
The cleanup uses kubectl get ... -o custom-columns and iterates with `for IR_NAME in
$EXISTING_IRS`, which is less robust than using machine-readable output. Elsewhere in this repo,
label-based IR queries use -o json + jq to avoid output parsing issues.

utils/internal-request[284-291]
utils/wait-for-internal-request[90-96]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The cleanup path parses InternalRequest names from `kubectl get ... -o custom-columns` and then relies on shell word-splitting. This is less robust than extracting names from structured output.

### Issue Context
`utils/wait-for-internal-request` already uses `kubectl ... -o json | jq ...` for label selectors, which is safer and consistent.

### Fix Focus Areas
- utils/internal-request[284-291]

### Suggested fix
Replace the custom-columns listing with a structured query, e.g.:
- `kubectl get internalrequest -l "${PIPELINERUN_UID_LABEL}=${PIPELINERUN_UID_VALUE}" -o json | jq -r '.items[].metadata.name'`
Then iterate line-by-line safely (e.g., `while IFS= read -r IR_NAME; do ...; done`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread utils/internal-request Outdated
Comment thread utils/internal-request Outdated
@swickersh swickersh force-pushed the konflux-13012 branch 2 times, most recently from 47504be to 7f6eb7e Compare May 6, 2026 16:21
Comment thread utils/internal-request
seanconroy2021
seanconroy2021 previously approved these changes May 7, 2026

@johnbieren johnbieren left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about two managed tasks that both create internalRequests running in parallel for the same plr? Shouldn't we also use the taskRunID label?

@swickersh

Copy link
Copy Markdown
Contributor Author

What about two managed tasks that both create internalRequests running in parallel for the same plr? Shouldn't we also use the taskRunID label?

@johnbieren really good catch!
After your comment, I started to use taskrun-uid as the additional discriminator, but the problem is that every task calling internal-request would need to be updated to pass internal-services.appstudio.openshift.io/taskrun-uid=$(context.taskRun.uid), which is a larger refactor touching all the managed tasks in release-service-catalog.

Instead, I'm trying to have the internal-request script automatically stamp a internal-services.appstudio.openshift.io/pipeline-name label (from the --pipeline argument it already receives) onto every IR it creates. The cleanup selector then becomes pipelinerun-uid=X,pipeline-name=push-disk-images. Since each parallel task in a release pipeline calls a different internal pipeline, this scopes the cleanup to only IRs from this specific task and no changes needed in the calling tasks.

I'll run some tests to make sure this works.

@johnbieren

Copy link
Copy Markdown
Collaborator

What about two managed tasks that both create internalRequests running in parallel for the same plr? Shouldn't we also use the taskRunID label?

@johnbieren really good catch! After your comment, I started to use taskrun-uid as the additional discriminator, but the problem is that every task calling internal-request would need to be updated to pass internal-services.appstudio.openshift.io/taskrun-uid=$(context.taskRun.uid), which is a larger refactor touching all the managed tasks in release-service-catalog.

Instead, I'm trying to have the internal-request script automatically stamp a internal-services.appstudio.openshift.io/pipeline-name label (from the --pipeline argument it already receives) onto every IR it creates. The cleanup selector then becomes pipelinerun-uid=X,pipeline-name=push-disk-images. Since each parallel task in a release pipeline calls a different internal pipeline, this scopes the cleanup to only IRs from this specific task and no changes needed in the calling tasks.

I'll run some tests to make sure this works.

love it, good idea

@swickersh

Copy link
Copy Markdown
Contributor Author

What about two managed tasks that both create internalRequests running in parallel for the same plr? Shouldn't we also use the taskRunID label?

@johnbieren really good catch! After your comment, I started to use taskrun-uid as the additional discriminator, but the problem is that every task calling internal-request would need to be updated to pass internal-services.appstudio.openshift.io/taskrun-uid=$(context.taskRun.uid), which is a larger refactor touching all the managed tasks in release-service-catalog.
Instead, I'm trying to have the internal-request script automatically stamp a internal-services.appstudio.openshift.io/pipeline-name label (from the --pipeline argument it already receives) onto every IR it creates. The cleanup selector then becomes pipelinerun-uid=X,pipeline-name=push-disk-images. Since each parallel task in a release pipeline calls a different internal pipeline, this scopes the cleanup to only IRs from this specific task and no changes needed in the calling tasks.
I'll run some tests to make sure this works.

love it, good idea

@johnbieren
Manually tested the cleanup logic against prd-rh02. Created a fake orphaned IR with the matching labels, then ran the script to simulate a retry:

# Create orphaned IR with pipelinerun-uid and pipeline-name labels
$ oc create -f - <<EOF
apiVersion: appstudio.redhat.com/v1alpha1
kind: InternalRequest
metadata:
  generateName: push-disk-images-orphan-
  namespace: rhtap-releng-tenant
  labels:
    internal-services.appstudio.openshift.io/pipelinerun-uid: "test-plr-uid-12345"
    internal-services.appstudio.openshift.io/pipeline-name: "push-disk-images"
spec:
  pipeline:
    pipelineRef:
      resolver: git
      params:
        - name: url
          value: https://github.com/konflux-ci/release-service-catalog
        - name: revision
          value: development
        - name: pathInRepo
          value: pipelines/internal/push-disk-images/push-disk-images.yaml
  params: {}
EOF
internalrequest.appstudio.redhat.com/push-disk-images-orphan-dplwv created
# Simulate retry - script detects and cleans up the orphan before creating a new IR
$ internal-request --pipeline "push-disk-images" \
  -p taskGitUrl=https://github.com/konflux-ci/release-service-catalog \
  -p taskGitRevision=development \
  -l internal-services.appstudio.openshift.io/pipelinerun-uid=test-plr-uid-12345 \
  -s false
Found existing InternalRequests from prior attempts. Cleaning up...
Deleting InternalRequest push-disk-images-orphan-dplwv...
internalrequest.appstudio.redhat.com "push-disk-images-orphan-dplwv" deleted from rhtap-releng-tenant namespace
Cleanup complete. Waiting for PipelineRun cancellation to propagate...
InternalRequest 'push-disk-images-6ld6b' created.

The orphan was deleted and a fresh IR created as expected.

johnbieren
johnbieren previously approved these changes May 14, 2026
ach912
ach912 previously approved these changes May 17, 2026
@ach912

ach912 commented May 21, 2026

Copy link
Copy Markdown

@swickersh the Qodo AI review flagged potential bugs or security concerns in this PR.
Please acknowledge each finding by adding a comment with the following format:

## Qodo Response

**<Finding title>** — <Your explanation (false positive / fixed in commit / accepted risk)>
**<Another finding>** — <Your response>

The approval dashboard will pick up this section and show it to the approver alongside the review.
Replace the <…> placeholders with the actual finding titles from the Qodo review and your responses.

1 similar comment
@ach912

ach912 commented May 21, 2026

Copy link
Copy Markdown

@swickersh the Qodo AI review flagged potential bugs or security concerns in this PR.
Please acknowledge each finding by adding a comment with the following format:

## Qodo Response

**<Finding title>** — <Your explanation (false positive / fixed in commit / accepted risk)>
**<Another finding>** — <Your response>

The approval dashboard will pick up this section and show it to the approver alongside the review.
Replace the <…> placeholders with the actual finding titles from the Qodo review and your responses.

@davidmogar

Copy link
Copy Markdown
Contributor

Production Approval Record

Field Value
Action APPROVED
Reviewer @davidmogar
Timestamp 2026-05-21T12:29:10.426Z

@johnbieren johnbieren added this pull request to the merge queue May 21, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 21, 2026
@johnbieren johnbieren added this pull request to the merge queue May 21, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 21, 2026
@swickersh

Copy link
Copy Markdown
Contributor Author

Qodo Response

Cleanup ignores delete failures — Fixed in current code. Removed 2>/dev/null || true from both the kubectl get and kubectl delete calls, and changed set -e to set -eo pipefail at the top of the script. If kubectl delete ... --wait=true --timeout=60s fails or times out, the script now exits non-zero, preventing a new IR from being created while an old one may still be running.

sleep 5 without PipelineRun check — Accepted limitation with rationale. The associated PipelineRun runs on a separate internal cluster and is cancelled by the internal-services controller when it observes the IR deletion. That propagation is cross-cluster and asynchronous — there is no way to poll for it from within this script without privileged cross-cluster access that the managed task context does not have. The sleep 5 is a best-effort grace period to let propagation start; this is explicitly called out in a code comment. Fully reliable synchronous cancellation would require infrastructure outside the scope of this script. The companion finalizer change in internal-services#709 ensures the IR is removed from the API server before the new one is created.

Fragile IR list parsing — Fixed in current code. Replaced kubectl get ... -o custom-columns with -o json | jq -r '.items[].metadata.name' for structured machine-readable output, and replaced the word-splitting for IR_NAME in $EXISTING_IRS loop with while IFS= read -r IR_NAME; do ...; done, consistent with the pattern already used in utils/wait-for-internal-request.

cc @ach912

@swickersh

Copy link
Copy Markdown
Contributor Author

@johnbieren since we reverted konflux-ci/internal-services#709 should we wait to merge this?
afaict this wouldn't harm anything. But I don't want to break prod..

@johnbieren

Copy link
Copy Markdown
Collaborator

@swickersh gitlint is saying you are at 73 chars instead of 72 (idk why it passes on the PR but not in merge queue)

@johnbieren

Copy link
Copy Markdown
Collaborator

@johnbieren since we reverted konflux-ci/internal-services#709 should we wait to merge this? afaict this wouldn't harm anything. But I don't want to break prod..

🤷‍♂️ you tell me. If you think it will break things, can you move it to draft? If you think it is a good change, we can merge it. It will need to be followed by a catalog update which then has to go through the promotion process

@swickersh

Copy link
Copy Markdown
Contributor Author

@johnbieren

@swickersh gitlint is saying you are at 73 chars instead of 72 (idk why it passes on the PR but not in merge queue)

This is because the merge queue commit appends 7 chars (a space plus (# 734)) to the original commit.

We'd need logic in the gitlint command to limit 65 chars in the original commit title.
I'll edit the commit title for now

🤷‍♂️ you tell me. If you think it will break things, can you move it to draft? If you think it is a good change, we can merge it. It will need to be followed by a catalog update which then has to go through the promotion process

yea, sorry. That's on me to check closer. I checked the logic and walked through the scenario with Cursor but it steered me wrong on the 709 change that broke stage so I'm overly cautious now.

This is what cursor is telling me which seems to check out. It seems fine to merge prior to the finalizer change landing in internal-services

PR 734 deletes the orphaned IR via kubectl delete --wait=true, but without the finalizer from #709, the internal-services controller has nothing to hook into — the IR is just gone, and the associated internal PipelineRun continues running to natural completion. So duplicate concurrent runs can still occur until #709 lands.

This is not a regression: without 734, the orphaned IR and its PipelineRun would also keep running — there was never any cancellation mechanism in the current codebase. The kubectl delete here is intentionally setting up the hook point: once #709 is re-implemented, the controller will intercept that deletion via the finalizer and cancel the internal PipelineRun before the new IR is created.

@johnbieren

Copy link
Copy Markdown
Collaborator

@johnbieren

@swickersh gitlint is saying you are at 73 chars instead of 72 (idk why it passes on the PR but not in merge queue)

This is because the merge queue commit appends 7 chars (a space plus (# 734)) to the original commit.

We'd need logic in the gitlint command to limit 65 chars in the original commit title. I'll edit the commit title for now

🤷‍♂️ you tell me. If you think it will break things, can you move it to draft? If you think it is a good change, we can merge it. It will need to be followed by a catalog update which then has to go through the promotion process

yea, sorry. That's on me to check closer. I checked the logic and walked through the scenario with Cursor but it steered me wrong on the 709 change that broke stage so I'm overly cautious now.

This is what cursor is telling me which seems to check out. It seems fine to merge prior to the finalizer change landing in internal-services

PR 734 deletes the orphaned IR via kubectl delete --wait=true, but without the finalizer from #709, the internal-services controller has nothing to hook into — the IR is just gone, and the associated internal PipelineRun continues running to natural completion. So duplicate concurrent runs can still occur until #709 lands.

This is not a regression: without 734, the orphaned IR and its PipelineRun would also keep running — there was never any cancellation mechanism in the current codebase. The kubectl delete here is intentionally setting up the hook point: once #709 is re-implemented, the controller will intercept that deletion via the finalizer and cancel the internal PipelineRun before the new IR is created.

good catch about the gitlint thing. how annoying...

@swickersh

Copy link
Copy Markdown
Contributor Author

/retest

@swickersh swickersh added this pull request to the merge queue Jul 1, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 1, 2026
When the managed task is retried (e.g. due to timeout), the previous
InternalRequest and its PipelineRun on the internal cluster continue
running. The retry creates a new InternalRequest, leading to concurrent
operations that cause duplicate Pulp pushes and CGW "Record already
present" errors.

Before creating a new InternalRequest, the script now deletes any
existing InternalRequests with the same PipelineRun UID label. Combined
with the finalizer change in internal-services(PR 709), this ensures
the associated PipelineRun is cancelled before the new one starts.

Assisted-by: Cursor AI
Signed-off-by: Scott Wickersham <swickers@redhat.com>
@swickersh swickersh dismissed stale reviews from davidmogar, johnbieren, and ach912 via ec666a3 July 2, 2026 13:23
@swickersh

Copy link
Copy Markdown
Contributor Author

Merge ci failed due to python docstring requirements that were added.

@swickersh swickersh added this pull request to the merge queue Jul 2, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants