Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 103 additions & 2 deletions .claude/agents/debug-package.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,9 +182,9 @@ Release `.status.conditions[].type == "Released"` values:

**Symptoms**: No snapshot found for the commit SHA. PipelineRun may be missing too (GC'd) or may show a failed status.

**Root cause**: On-push pipeline never ran (Pipelines-as-Code misconfiguration, webhook failure) or it failed (build error, resource limits). Old PipelineRuns get garbage-collected.
**Root cause**: On-push pipeline never ran (Pipelines-as-Code misconfiguration, webhook failure) or it failed (build error, resource limits, transient image pull errors). Old PipelineRuns get garbage-collected.

**Remediation**: Trigger a new build by making a new commit that touches the package file:
**Remediation**: If the original PipelineRun failed due to a transient error (e.g. registry 503, image pull backoff), retry it for the original commit — see [Retrying a Failed On-Push Pipeline](#retrying-a-failed-on-push-pipeline) below. If there was never a PipelineRun at all, trigger a new build by making a new commit that touches the package file:
```bash
hack/onboard.sh <package>
```
Expand Down Expand Up @@ -268,6 +268,106 @@ EOF
jq -r '.[]' pulp_pkgs.json | grep -i "<pkg-name-with-dots>"
```

## Retrying a Failed On-Push Pipeline

When an on-push PipelineRun fails due to a transient error (registry 503, image pull backoff, OOM on infra containers), you need to retry it **for the original commit**. The Konflux UI "Rerun" button does NOT work correctly — it re-resolves PaC template variables (`{{revision}}`) against the current HEAD of main, so `identify-packages` diffs the wrong commit pair and builds the wrong (or no) packages.

### Why "Rerun" uses the wrong commit

The on-push template (`.tekton/calunga-v2-index-main-push.yaml`) uses:
```yaml
- name: revision
value: '{{revision}}' # resolved from push webhook payload
- name: prev-packages-ref
value: 'HEAD^' # parent of the checked-out commit
```

On rerun, PaC resolves `{{revision}}` to the latest commit on main, not the original. Since `HEAD^` is relative to the checked-out revision, the entire diff window shifts.

### Correct retry procedure

Extract the params and metadata from the original PipelineRun (from kubearchive if GC'd), combine with the **current** pipeline definition from the repo, and create a new run.

**Important**: Do NOT reuse the `pipelineSpec` from the archived PipelineRun. It contains old task bundle references (pinned by SHA digest) that may no longer be in the EC trusted task list, causing `required_untrusted_task_found` violations. Always use the current `.tekton/build-pipeline.yaml`.

```bash
FAILED_PLR="<pipelinerun-name>"

# 1. Fetch the archived PipelineRun
# Try live cluster first, fall back to kubearchive
oc get pipelinerun "$FAILED_PLR" -n calunga-tenant -o json > /tmp/archived-plr.json 2>/dev/null \
|| kubectl ka get pipelineruns.v1.tekton.dev "$FAILED_PLR" -n calunga-tenant -o json \
| jq '.items[0]' > /tmp/archived-plr.json

# 2. Verify the commit
jq -r '{
revision: (.spec.params[] | select(.name == "revision") | .value),
title: .metadata.annotations["pipelinesascode.tekton.dev/sha-title"]
}' /tmp/archived-plr.json

# 3. Convert the current pipeline definition to JSON
python3 -c "
import yaml, json, sys
with open('.tekton/build-pipeline.yaml') as f:
pipeline = yaml.safe_load(f)
json.dump(pipeline['spec'], sys.stdout)
" > /tmp/current-pipeline-spec.json

# 4. Build the retry PipelineRun (archived params + current pipeline spec)
jq -n \
--slurpfile archived /tmp/archived-plr.json \
--slurpfile pipelineSpec /tmp/current-pipeline-spec.json \
'{
apiVersion: $archived[0].apiVersion,
kind: $archived[0].kind,
metadata: {
generateName: "calunga-v2-index-main-on-push-retry-",
namespace: $archived[0].metadata.namespace,
annotations: (
$archived[0].metadata.annotations | with_entries(
select(.key | test("^(build\\.appstudio|test\\.appstudio|pipelinesascode\\.tekton\\.dev/(cancel-in-progress|max-keep-runs|on-cel-expression|original-prname|repository|sha|sha-title|sha-url|event-type|branch|source-branch|source-repo-url|repo-url|url-org|url-repository|git-provider|installation-id))"))
)
# Ensure ignore-supersession is set (older PLRs predate this annotation)
+ {"test.appstudio.openshift.io/ignore-supersession": "true"}
),
labels: (
$archived[0].metadata.labels | with_entries(
select(.key | test("^(appstudio|pipelines\\.appstudio|pipelinesascode\\.tekton\\.dev/(original-prname|repository|sha|event-type|url-org|url-repository|cancel-in-progress))|tekton\\.dev/pipeline"))
)
)
},
spec: {
params: $archived[0].spec.params,
pipelineSpec: $pipelineSpec[0],
taskRunTemplate: $archived[0].spec.taskRunTemplate,
taskRunSpecs: $archived[0].spec.taskRunSpecs,
workspaces: [
{
name: "git-auth",
secret: {
secretName: "git-auth-dummy"
}
}
]
}
}' > /tmp/retry-plr.json

# 5. Apply
oc create -f /tmp/retry-plr.json -n calunga-tenant

# 6. Verify it started
oc get pipelinerun -n calunga-tenant -l "pipelinesascode.tekton.dev/sha=$(jq -r '.spec.params[] | select(.name == "revision") | .value' /tmp/archived-plr.json)" \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[0].reason}{"\n"}{end}'
```

### Key details

- **Current pipelineSpec**: The retry uses the pipeline definition from the current `.tekton/build-pipeline.yaml`, NOT the inlined spec from the archived run. Archived specs contain stale task bundle digests that fail EC trusted-task checks.
- **git-auth-dummy**: PaC creates ephemeral `pac-gitauth-*` secrets per run — they're deleted after the run. The `git-auth-dummy` secret (empty credentials) works because the repo is public.
- **Stripped annotations**: The `check-run-id` and `git-auth-secret` PaC annotations are intentionally removed to prevent PaC from trying to update a stale GitHub check or use a deleted secret. The remaining PaC annotations (`sha`, `repository`, `original-prname`, etc.) are kept so the Integration Service can create a Snapshot for the correct commit.
- **ignore-supersession**: The retry always injects `test.appstudio.openshift.io/ignore-supersession: "true"`. Archived PLRs from before commit `adbf127f` won't have this annotation, and without it the new snapshot can get superseded ("Released in newer Snapshot"), requiring a manual Release CR.
- **Alternative**: If you have webhook admin access on the GitHub repo, you can redeliver the original push webhook from Settings > Webhooks > Recent Deliveries. This is simpler but requires elevated access.

## Bulk Operations

### Find all packages missing from Pulp
Expand Down Expand Up @@ -374,6 +474,7 @@ done
- **Snapshots may have multiple releases.** Always query as an array and iterate: `jq '[.items[] | select(...)]'` then `jq -c '.[]' | while read -r rel`.
- **Timed-out releases often succeed eventually.** A release stuck at "Progressing" for 10+ minutes usually finishes — it's just slow, not broken. Check back later.
- **PipelineRuns are garbage-collected.** After ~5 days, old PipelineRuns are deleted. The snapshot still exists and references the build, but you can't inspect the PipelineRun directly.
- **Do NOT use the Konflux UI "Rerun" button for on-push pipelines.** It re-resolves `{{revision}}` to the latest commit on main, causing `identify-packages` to diff the wrong commits. Use the manual retry procedure instead.
- **Batch onboarding causes "Released in newer Snapshot".** When many packages are committed in quick succession, only the latest snapshot gets auto-released. All earlier snapshots (each containing a unique package build) need manual Release CRs. This has been mitigated by adding `test.appstudio.openshift.io/ignore-supersession: "true"` to the on-push PipelineRun annotation (commit `adbf127f`), but older snapshots from before the fix may still be affected.
- **Name normalization is real.** Always check Pulp with both dash and dot variants. Common prefixes: `backports`, `jaraco`, `zope`, `ruamel`.
- **Release CR template requires `releasePlan: calunga`.** This references the ReleasePlan CR in the namespace. The `gracePeriodDays: 7` field controls how long the release artifacts are retained.
Expand Down
2 changes: 1 addition & 1 deletion .tekton/build-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ spec:
name: prev-packages-ref
type: string
- default: "1"
descripton: Git clone depth
description: Git clone depth
name: git-clone-depth
type: string
- name: enable-cache-proxy
Expand Down