| description |
|---|
Analyze Jenkins job failures (stage-testing) using AI |
You are helping the user analyze failures from a Jenkins job run (typically stage-testing jobs used in z-stream release testing).
Note: Image-consistency-check has been migrated from Jenkins to Prow. For Prow job failures, use
/ci:analyze-failuresinstead.
The user has provided a Jenkins job URL: {{args}}
Jenkins jobs used in OAR z-stream release workflow:
- stage-testing (Stage-Pipeline): Runs E2E tests for optional operators shipped with Openshift
Note: image-consistency-check has been migrated to Prow and is no longer a Jenkins job.
Each job has its own custom console log format. This command fetches the raw console log and analyzes it based on the job type.
Expected URL patterns:
https://jenkins-csb-openshift-qe-mastern.dno.corp.redhat.com/job/{job_name}/{build_number}/
Extract:
- Base URL: e.g.,
https://jenkins-csb-openshift-qe-mastern.dno.corp.redhat.com - Job name: e.g.,
zstreams/Stage-Pipeline - Build number: e.g.,
3436
Jenkins provides a JSON API to get job parameters (more reliable than parsing console logs).
Construct the API URL:
For simple job names:
{base_url}/job/{job_name}/{build_number}/api/json
For folder-based job names (e.g., zstreams/Stage-Pipeline):
{base_url}/job/zstreams/job/Stage-Pipeline/{build_number}/api/json
Fetch job metadata and parameters:
curl -s "{api_url}"Extract key parameters from JSON response:
The response contains actions array with a hudson.model.ParametersAction object containing all build parameters:
{
"actions": [
{
"_class": "hudson.model.ParametersAction",
"parameters": [
{
"_class": "hudson.model.StringParameterValue",
"name": "VERSION",
"value": "v4.20"
},
{
"_class": "hudson.model.StringParameterValue",
"name": "PAYLOAD_URL",
"value": "quay.io/openshift-release-dev/ocp-release:4.20.1-x86_64"
},
{
"_class": "hudson.model.StringParameterValue",
"name": "SHIPMENT_MR_ID",
"value": "189"
}
]
}
],
"result": "SUCCESS" | "FAILURE" | "UNSTABLE" | "ABORTED",
"building": false,
"duration": 5668839,
"timestamp": 1761310656601
}Key fields to extract:
result: Job final status (SUCCESS, FAILURE, UNSTABLE, ABORTED)building: Whether job is still runningduration: Job execution time in millisecondstimestamp: Job start timeparameters: All build parameters (PAYLOAD_URL, SHIPMENT_MR_ID, VERSION, etc.)
Parse parameters using jq (if available):
# Get PAYLOAD_URL
curl -s "{api_url}" | jq -r '.actions[] | select(._class=="hudson.model.ParametersAction") | .parameters[] | select(.name=="PAYLOAD_URL") | .value'
# Get SHIPMENT_MR_ID
curl -s "{api_url}" | jq -r '.actions[] | select(._class=="hudson.model.ParametersAction") | .parameters[] | select(.name=="SHIPMENT_MR_ID") | .value'
# Get job result
curl -s "{api_url}" | jq -r '.result'Alternative: Parse using grep/sed (if jq not available):
# Get all parameters
curl -s "{api_url}" | grep -o '"name":"[^"]*","value":"[^"]*"'Jenkins console logs are publicly accessible via /consoleText endpoint (no authentication required).
Construct the console text URL:
For simple job names:
{base_url}/job/{job_name}/{build_number}/consoleText
For folder-based job names (e.g., zstreams/Stage-Pipeline):
{base_url}/job/zstreams/job/Stage-Pipeline/{build_number}/consoleText
Fetch the log:
curl -s "{console_text_url}"Token Management:
- Console logs can be 5K-20K+ lines
- Focus on last 1000-2000 lines for most relevant information:
curl -s "{console_text_url}" | tail -n 2000
- For errors/failures specifically:
curl -s "{console_text_url}" | grep -i "error\|fail\|exception" | tail -n 500
This job verifies images from release payload against shipment data:
- Iterates all images from release payload (PAYLOAD_URL parameter)
- Compares with shipment data (YAML files in GitLab MR specified by SHIPMENT_MR_ID)
- Multi-arch images are compared using default arch
linux/amd64 - Reports mismatches if images cannot be found in shipment data
Key console log sections:
#Mismatch between payload and Advisory
CRITICAL - Lists images not found in shipment data or registries.
Example format:
#Mismatch between payload and Advisory
--------------------------------------
Image status Justification
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:07f75fe... FAIL Couldn't find this image...
pull_spec: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:07f75fe... != None
digest: sha256:07f75fe65ee24835d18df82cb11c2bd28424837e0bea3f9534a4ebd40706c409
src_commit ID: 1db726a940d5ec150fd185a215f1368990653082
tag: v4.20.0-202510211040.p0.g1db726a.assembly.stream.el9
How the script works (important context):
- Uses
oc image info --filter-by-os linux/amd64to compare images - Comparison based on digest, listdigest, or vcs-ref
- First checks if image exists in
registry.redhat.io(already shipped) - Then queries
pyxis.engineering.redhat.comAPI to check if image is signed/shipped - Only reports FAIL if image NOT found in: shipment MR AND registry.redhat.io AND pyxis
MANDATORY: Validate each reported image failure
For each image in the "#Mismatch between payload and Advisory" section, you MUST:
-
Search console log for the image SHA (first 8-12 chars of digest):
# The script logs: #checking quay.io/.../sha256:07f75fe... # Search for this pattern to see if image was processed
-
Check the image processing result:
- Look for lines AFTER "#checking {image_sha}"
- Check for:
[INFO]log level = successful processing[ERROR]or[WARNING]= issues during processing- Errors like:
manifest unknown,unauthorized,connection refused
-
CRITICAL: Check for multi-arch manifest parsing errors:
Multi-arch images can fail manifest parsing with transient errors. This is the MOST COMMON false positive.
Detection pattern:
- Search console for
[WARNING] root: OpenshiftImage: oc image infowith ANY error (Return code: 1) - Common error patterns include (but not limited to):
error: unable to read image ... unable to retrieve source image ... manifest #N from manifest listerror unmarshalling content: invalid character 'H' looking for beginning of valueerror: unauthorized: authentication requirederror: manifest unknown- Any other
oc image infofailure on multi-arch images
Verification workflow (MANDATORY):
For each WARNING message with
oc image infoerror found:a. Extract the multi-arch image SHA from the warning:
# Example: [WARNING] root: OpenshiftImage: oc image info quay.io/.../sha256:4b7314bc... # Extract: sha256:4b7314bc...
b. Check if this multi-arch image has arch-specific variants:
# Run locally to get manifest list oc image info quay.io/redhat-user-workloads/ocp-art-tenant/art-images@sha256:4b7314bc... --filter-by-os linux/amd64c. Extract the linux/amd64 image digest from the manifest:
# Output will show: # Name: quay.io/.../sha256:4b7314bc... # Digest: sha256:07f75fe6... ← This is the arch-specific image
d. Compare with mismatched image SHA:
- If the arch-specific digest (e.g.,
sha256:07f75fe6...) matches the image in mismatch list - Then this confirms the mismatch is caused by the multi-arch manifest parsing error
- Categorization: Transient infrastructure issue → RETRY
How this manifests:
- Script tries:
oc image info quay.io/.../sha256:4b7314bc...(multi-arch image from shipment MR) - Gets error:
error unmarshalling content: invalid character 'H' - Script fallback:
oc image info --filter-by-os linux/amd64 quay.io/.../sha256:4b7314bc... - Extracts arch-specific digest:
sha256:07f75fe6... - This arch-specific image appears in mismatch list because parent check failed
Example from real failure:
[WARNING] root: OpenshiftImage: oc image info quay.io/redhat-user-workloads/ocp-art-tenant/art-images@sha256:4b7314bc... Return code: 1 error: unable to read image sha256:4b7314bc...: unable to retrieve source image sha256:4b7314bc...: error unmarshalling content: invalid character 'H' looking for beginning of value # Verify by checking the arch-specific variant: $ oc image info quay.io/redhat-user-workloads/ocp-art-tenant/art-images@sha256:4b7314bc... --filter-by-os linux/amd64 Name: quay.io/redhat-user-workloads/ocp-art-tenant/art-images@sha256:4b7314bc... Digest: sha256:07f75fe65ee24835d18df82cb11c2bd28424837e0bea3f9534a4ebd40706c409 # Then in mismatch list: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:07f75fe... FAIL Couldn't find this image... # ✅ MATCH CONFIRMED: sha256:07f75fe... is the linux/amd64 variant of sha256:4b7314bc...Categorization (only if SHA match confirmed):
- This is a transient infrastructure issue, NOT a payload failure
- Recommendation: RETRY JOB (not escalate to ART)
- The image exists in shipment MR but manifest retrieval failed temporarily
Important: If the arch-specific SHA does NOT match the mismatch image, this is NOT a multi-arch manifest parsing issue. Continue investigating other root causes.
- Search console for
-
Understand other FALSE POSITIVE scenarios:
- Already shipped: Image found in
registry.redhat.iobut not in MR- Script should say: "This image isn't attached to advisories, but can be found in registry.redhat.io"
- This is NOT a failure - image was shipped in previous release
- Successful check: Found
#checking ...sha256:07f75fe [INFO]with no errors- Image was processed successfully
- Mismatch report is script error
- Already shipped: Image found in
-
Identify GENUINE FAILURES:
- Missing from shipment MR: Image not in YAML files AND not in registry.redhat.io
- Justification: "Couldn't find this image on either advisory or registry.redhat.io"
- Pull errors or oc command failures for this specific SHA
- AND no multi-arch manifest parsing warnings for this image
- Network/pull errors: Errors pulling/checking the image
- Can be transient (retry) or genuine missing image
- Check if error is multi-arch manifest parsing (retry) vs genuine pull failure (investigate)
- Missing from shipment MR: Image not in YAML files AND not in registry.redhat.io
-
Root cause categories:
- Shipment data issue: ART forgot to include image in MR → ESCALATE TO ART
- Already shipped: Image in previous release, not needed in MR → SAFE TO IGNORE
- Network/transient: Temporary infrastructure issue → RETRY JOB
- Script error: False positive from checker → FILE BUG FOR SCRIPT
Analysis Output Template:
## Jenkins Job Analysis: image-consistency-check
**Job Information**:
- URL: {jenkins_url}
- Console Log: {console_text_url}
- Build: #{build_number}
- Status: {SUCCESS|FAILURE|UNSTABLE} {emoji}
- Payload: {payload_url}
- Advisories/MR: {errata_numbers or shipment_mr_id}
- Version: {version}
**Analysis Results**:
### 1. Image Consistency Check
{Parse #Image Check Result section}
- Status: {✅ PASS | ❌ FAIL}
- Details: {list inconsistencies if any, otherwise "No image inconsistencies found"}
**Overall Assessment**:
**Critical Issues** ({count}):
{List blockers that MUST be resolved}
**Warnings** ({count}):
{List non-blocking issues for awareness}
**Recommendation**: {✅ PASS | ❌ FAIL | 🔄 RETRY | ⚠️ NEEDS REVIEW}
**Rationale**: {Explain why job should pass/fail/retry}
- {Reason 1}
- {Reason 2}
**Special cases**:
- If multi-arch manifest parsing errors detected → 🔄 RETRY (not escalate to ART)
- If images already shipped in registry.redhat.io → ✅ PASS (safe to ignore)
- If genuine missing images (not in MR AND not in registry.redhat.io AND no manifest errors) → ❌ FAIL (escalate to ART)
**Next Actions**:
1. {Action if needed}
2. {Action if needed}This job orchestrates E2E testing for optional operators shipped with OpenShift:
Pipeline Flow:
- Verify Metadata - Validates metadata advisory (may be skipped for Konflux flow)
- Install Cluster - Triggers
Flexy-installchild job to provision OCP cluster - Create Test Run - Prepares test execution environment
- Non-admin and non-destructive tests - Runs 3 parallel
Runnerchild jobs - Normal ginkgo tests - Triggers
ginkgo-testchild job - Admin tests - Runs 3 parallel
Runnerchild jobs - Destructive tests - Runs 1
Runnerchild job - Disruptive ginkgo tests - Triggers
ginkgo-testchild job - Check Results - Validates test results and sends failure notifications
- Cleanup - Triggers
Flexy-destroychild job to tear down cluster
Key console log patterns:
Pipeline stages (look for):
[Pipeline] stage
[Pipeline] { (Stage Name)
Each stage shows:
- START marker:
[Pipeline] stage - Stage name:
[Pipeline] { (Install Cluster) - End marker:
[Pipeline] // stage - May be skipped:
Stage "Verify Metadata" skipped due to when conditional
Child jobs (critical for analysis):
Scheduling project: ocp-common » Flexy-install
Starting building: ocp-common » Flexy-install #360950
Build ocp-common » Flexy-install ocpqe-jenkins-bot-360950 completed: SUCCESS
Format:
- Scheduling: When job is queued
- Starting building: When job execution begins (includes build number)
- Build completed: Final status (SUCCESS/FAILURE/UNSTABLE)
Child Job Types:
Flexy-install- OCP cluster installation (1 job)Runner- Cucumber test runner (7 parallel jobs total)- IMPORTANT: Runner jobs use Cucumber tests from verification-tests or cucushift repositories
- NOT from openshift-tests-private (which is for Ginkgo tests only)
ginkgo-test- Ginkgo test runner (2 jobs)- IMPORTANT: ginkgo-test jobs use Ginkgo tests from openshift-tests-private repository
- NOT from verification-tests/cucushift (which are for Cucumber tests)
Flexy-destroy- Cluster cleanup (1 job)
Job Parameters (search in first 300 lines):
PULL_SPEC=quay.io/openshift-release-dev/ocp-release:4.15.59-x86_64
- Indicates which OpenShift release payload is being tested
How to Analyze Failures:
When stage-pipeline fails, follow these steps:
-
Check Bundle Image Availability (ALWAYS RUN THIS FIRST):
Stage testing failures may be caused by missing operator bundle images in registry.stage.redhat.io.
Extract OCP version from job parameters:
# Get PULL_SPEC or PAYLOAD_URL parameter (e.g., "quay.io/openshift-release-dev/ocp-release:4.20.1-x86_64") # Extract major.minor version (e.g., "4.20") VERSION=$(curl -s "{api_url}" | jq -r '.actions[] | select(._class=="hudson.model.ParametersAction") | .parameters[] | select(.name=="PULL_SPEC" or .name=="PAYLOAD_URL") | .value' | grep -oE '[0-9]+\.[0-9]+' | head -1)
Run bundle availability check:
# From release-tests repository root ./tools/check_stage_bundle_images.sh ${VERSION}
Include in analysis output:
If bundles are missing (exit code 1):
## Bundle Image Availability Check **Version**: {version} **Status**: ❌ Missing bundles detected **Missing Operator Bundles** ({count}): - {operator-name-1} - {operator-name-2} - {operator-name-3} ... **Report**: {/tmp/stage-bundle-check-YYYYMMDD-HHMMSS/MISSING_REPORT.txt} **Impact**: These operator bundle images are missing from registry.stage.redhat.io. This may cause operator installation failures if tests require these operators. **Action Required**: - Share MISSING_REPORT.txt with ART team - File ticket to mirror missing bundles to stage registry
If all bundles available (exit code 0):
## Bundle Image Availability Check **Version**: {version} **Status**: ✅ All latest bundles available All {count} operator bundle images are properly mirrored to stage registry.
IMPORTANT:
- Do NOT attempt automatic correlation between missing bundles and test failures
- Report missing bundles as a separate finding
- Continue with normal test failure analysis below
- Let user determine if missing bundles are related to specific test failures
-
Identify Failed Child Job:
- Search for
completed: FAILUREorcompleted: UNSTABLE - Example:
Build ocp-common » Runner #1116884 completed: FAILURE - Note the child job name and build number
- Search for
-
Determine Stage Where Failure Occurred:
- Find the stage containing the failed child job
- Stages: "Install Cluster", "non-admin and non-destructive tests", "normal ginkgo tests", "admin tests", "destructive tests", "disruptive ginkgo tests"
-
Check Child Job Logs (CRITICAL for root cause):
- Child job URLs follow pattern:
https://jenkins-csb-openshift-qe-mastern.dno.corp.redhat.com/job/ocp-common/job/{job_type}/{build_number}/consoleText - For
Flexy-install/Flexy-destroy: Infrastructure/cluster provisioning issues - For
Runner: Cucumber test failures - MUST check child job console for failed scenario - For
ginkgo-test: Ginkgo E2E test failures - MUST check child job console for failed test
- Child job URLs follow pattern:
-
Analyze Cucumber Test Failures (Runner jobs):
Child Job Console Analysis:
- Fetch child job console:
curl -s "https://jenkins-csb-openshift-qe-mastern.dno.corp.redhat.com/job/ocp-common/job/Runner/{build_number}/consoleText" - Search for failure markers:
Scenario: <scenario_name> # Cucumber scenario that failed expected: <expected_value> got: <actual_value> Failing Scenarios: cucumber features/<feature_file>.feature:<line_number> # <scenario_name>
Identify Failed Test File:
- Failed scenario format:
cucumber features/logging/elasticsearch.feature:45 - Extract feature file path and scenario name
Locate Test Source Code:
- Cucumber tests come from TWO repositories:
- https://github.com/openshift/verification-tests/ (public test repo)
- https://github.com/openshift/cucushift (internal QE repo - may need auth)
- Feature file path from error corresponds to:
features/<path>in repo - Example:
features/logging/elasticsearch.feature→ clone repo and find this file
Analyze Test Code:
- Use Read tool to examine the feature file at the failed line number
- Look for the scenario definition (starts with
Scenario:) - Examine step definitions to understand what the test is validating
- Common patterns:
- Resource creation/deletion failures
- Assertion failures (expected vs actual values)
- Timeout waiting for resource state
- API call failures
Assessment:
- If test code has issues (race condition, flaky selector, hardcoded wait): Test Automation Issue
- If error indicates product functionality broken: Product Bug
- If timeout or resource not ready: Infrastructure or Timing Issue
- Fetch child job console:
-
Analyze Ginkgo Test Failures (ginkgo-test jobs):
Child Job Console Analysis:
- Fetch child job console:
curl -s "https://jenkins-csb-openshift-qe-mastern.dno.corp.redhat.com/job/ocp-common/job/ginkgo-test/{build_number}/consoleText" - Search for failure markers (similar to Prow Ginkgo failures):
• Failure [<duration>] [FAILED] <test_name> Expected <condition> error: <error_message>
Identify Failed Test:
- Extract test name and error message
- Look for case ID pattern in test name:
-{digits}-(e.g.,OCP-12345or test with-64296-)
Locate Test Source Code:
- Repository: https://github.com/openshift/openshift-tests-private/
- CRITICAL: Checkout correct release branch for z-stream version
- For 4.19.z releases:
git checkout release-4.19 - For 4.20.z releases:
git checkout release-4.20 - Pattern:
release-{major.minor}(use z-stream version's major.minor)
- For 4.19.z releases:
- Setup:
# Check if repo exists if [ -d "../openshift-tests-private" ]; then cd ../openshift-tests-private && git fetch origin else cd .. && git clone git@github.com:openshift/openshift-tests-private.git fi # Checkout correct branch (extract version from PULL_SPEC) # e.g., PULL_SPEC=quay.io/.../4.15.59-x86_64 → release-4.15 git checkout release-{X.Y} && git pull
Search for Test Code:
- If case ID found (e.g.,
64296): Search for-{case_id}-patterngrep -r "\-64296\-" test/extended/ - If no case ID: Search by test description keywords
- Test files are in
test/extended/directory
Analyze Test Code:
- Use Read tool to examine the test file
- Understand test logic: what is being validated?
- Check for common issues:
- Hardcoded timeouts that are too short
- Missing retries for eventually-consistent operations
- Improper error handling
- Race conditions in resource creation/validation
Assessment:
- If test code has clear issues: Test Automation Issue → File OCPQE bug
- If test correctly validates product behavior: Product Bug → File OCPBUGS issue
- If uncertain: Needs Review → Analyze both test and product behavior
- Fetch child job console:
-
Common Failure Patterns:
- Flexy-install fails: Cluster installation timeout, quota exceeded, infrastructure issues
- Runner (Cucumber) fails: Operator functionality tests - check feature file for scenario details
- ginkgo-test fails: E2E test failure - check test code for validation logic
- Flexy-destroy fails: Cleanup timeout - NOT A BLOCKER (cluster cleanup issues don't affect release approval)
-
Categorize Failure Type:
- Infrastructure: Cluster installation, network, quota issues → RETRY
- Product Bug: Operator/feature functionality broken → FAIL (file OCPBUGS)
- Test Automation Issue: Flaky test, improper assertions, race conditions → File OCPQE bug
- Test Flake: Intermittent failure with no clear root cause → NEEDS REVIEW (check history)
- Known Issue: Already tracked bug → Check if waivable
- Cleanup Failure: Flexy-destroy job failed → SAFE TO IGNORE (not a release blocker)
-
Live Cluster Access for Deep Troubleshooting (OPTIONAL but highly valuable):
Check if Cluster is Still Running:
- Stage-pipeline may keep cluster alive after test failures for debugging
- Cluster is only destroyed at the end by
Flexy-destroyjob - If
Flexy-destroyhasn't run or failed, cluster may still be accessible
Obtain Kubeconfig:
- Kubeconfig is stored as Jenkins artifact in
Flexy-installjob - URL pattern:
https://jenkins-csb-openshift-qe-mastern.dno.corp.redhat.com/job/ocp-common/job/Flexy-install/{build_number}/artifact/workdir/install-dir/auth/kubeconfig - Extract
Flexy-installbuild number from stage-pipeline console (e.g.,#360923) - Download kubeconfig:
# Extract Flexy-install build number from stage-pipeline log # Look for: "Build ocp-common » Flexy-install #360923 completed: SUCCESS" BUILD_NUM=360923 # Replace with actual build number curl -s "https://jenkins-csb-openshift-qe-mastern.dno.corp.redhat.com/job/ocp-common/job/Flexy-install/${BUILD_NUM}/artifact/workdir/install-dir/auth/kubeconfig" -o /tmp/kubeconfig-${BUILD_NUM}
Verify Cluster Accessibility:
export KUBECONFIG=/tmp/kubeconfig-${BUILD_NUM} # Test cluster connectivity oc cluster-info oc get nodes oc get clusterversion
If Cluster is Active - Deep Troubleshooting:
When cluster is accessible, you can perform deep analysis:
For Cucumber Test Failures (Runner jobs):
CRITICAL: Check CatalogSource Health First:
Stage-testing failures often occur because CatalogSource is not READY or operators are not in the index image. ALWAYS check this first:
# 1. Check CatalogSource connection state (MUST be READY) oc get catalogsource -n openshift-marketplace -o yaml # Or use yq to extract connection state directly: oc get catalogsource redhat-operators -n openshift-marketplace -o yaml | yq -y '.status.connectionState.lastObservedState' # Expected output: READY # Check all catalogsources: for catsrc in $(oc get catalogsource -n openshift-marketplace -o name); do echo "Checking $catsrc" oc get $catsrc -n openshift-marketplace -o jsonpath='{.metadata.name}: {.status.connectionState.lastObservedState}{"\n"}' done # Expected output for all catalogsources: READY # If NOT READY, check the full status: oc get catalogsource <catalogsource-name> -n openshift-marketplace -o yaml
Interpret CatalogSource Status:
# Look for these fields in catalogsource YAML: status: connectionState: lastObservedState: READY # ← MUST be READY address: ... # gRPC connection address # Common non-READY states and causes: # - CONNECTING → Still initializing, wait a bit # - TRANSIENT_FAILURE → Network/registry issues, may recover # - "" (empty) → CatalogSource pod not running or failed # If state is not READY, check catalogsource pod: oc get pods -n openshift-marketplace | grep <catalogsource-name> oc logs <catalogsource-pod> -n openshift-marketplace
Check Subscription Status for Operator Availability:
# 2. Check subscription status oc get subscription <operator-subscription> -n openshift-operators -o yaml # Look for these CRITICAL error patterns in subscription status: # Under .status.conditions[] array: # ❌ "constraints not satisfiable: no operators found" # → Operator is NOT in the index image (catalogsource) # → This is NOT a product bug - operator not included in this release # → Action: Verify operator should be in this release's index image # ❌ "install plan not approved" # → Manual approval required (configuration issue) # ❌ "no package found" # → Wrong package name or operator doesn't exist in catalog
If "constraints not satisfiable: no operators found":
# 3. Verify what operators are available in the catalogsource oc get packagemanifests -n openshift-marketplace | grep <operator-name> # Alternative method if oc get packagemanifest doesn't work: # Use opm command to list packages from index image directly # Tag pattern: ocp-{major.minor} (e.g., ocp-4.15, ocp-4.19) # Prerequisite: opm command must be available in PATH opm --skip-tls-verify alpha list packages quay.io/openshift-art/stage-fbc-fragments:ocp-4.15 # Example: Check if specific operator package exists opm --skip-tls-verify alpha list packages quay.io/openshift-art/stage-fbc-fragments:ocp-4.19 | grep <operator-package-name> # If operator is NOT listed: # - Operator is not included in this release's index image # - This is EXPECTED for optional operators not shipping in this z-stream # - **NOT a test failure** - test should be skipped or marked as expected # - Action: Confirm with ART if operator should be in this index # 4. Check which catalogsource provides the operator (if packagemanifest works) oc get packagemanifest <operator-name> -n openshift-marketplace -o yaml # Look at: .status.catalogSource and .status.catalogSourceNamespace
After confirming CatalogSource is READY and operator is available:
-
Check operator pod status:
# Find namespace where operator is deployed (e.g., openshift-logging, openshift-storage) oc get pods -A | grep -i <operator-name> # Check pod status and events oc describe pod <pod-name> -n <namespace> oc logs <pod-name> -n <namespace> # Check operator resource status oc get <operator-crd> -n <namespace> oc describe <operator-crd> <resource-name> -n <namespace>
-
Verify operator installation:
oc get clusterserviceversion -n openshift-operators oc get installplan -n openshift-operators oc describe subscription <operator-subscription> -n openshift-operators
-
Check cluster events:
oc get events -n <namespace> --sort-by='.lastTimestamp'
Root Cause Categories for Operator Failures:
- CatalogSource not READY → Infrastructure/index image/registry issue → RETRY or escalate
- Operator not in index ("constraints not satisfiable") → NOT A BUG - verify expected for this release
- InstallPlan pending → OLM configuration issue or approval required
- Operator pod CrashLoopBackOff → Product bug → File OCPBUGS
- Timeout waiting for operator → May be resource constraints or slow deployment
For Ginkgo Test Failures (ginkgo-test jobs):
-
Check test namespace (tests often create temporary namespaces):
# List recent namespaces oc get namespaces --sort-by=.metadata.creationTimestamp # Check test resources in specific namespace oc get all -n <test-namespace> oc get events -n <test-namespace>
-
Examine failed resources mentioned in test error:
# If test mentions specific resource (pod, deployment, etc.) oc get <resource-type> <resource-name> -n <namespace> -o yaml oc describe <resource-type> <resource-name> -n <namespace>
-
Check cluster-wide resources if test involves cluster operators:
oc get co # Cluster operators oc get nodes oc describe co <operator-name>
Common Debugging Patterns:
- Pod CrashLoopBackOff: Check pod logs for stack traces
oc logs <pod-name> -n <namespace> --previous
- ImagePullBackOff: Verify image pullspec and registry access
oc describe pod <pod-name> -n <namespace> | grep -A 5 "Events:"
- Resource Not Ready: Check conditions and events
oc get <resource> -n <namespace> -o jsonpath='{.status.conditions}'
When to Use Live Cluster Access:
- ✅ Test failed with unclear error message → Check actual resource state
- ✅ Timeout waiting for resource → Verify resource exists and check conditions
- ✅ Operator behavior issue → Check operator pods and logs
- ✅ Need to verify if issue is environmental vs product bug
- ❌ Clear test automation issue → No need to check cluster
- ❌ Infrastructure failure during cluster install → Cluster not accessible
Important Notes:
- Cluster may be torn down shortly after job completion
- Kubeconfig access requires no authentication (public Jenkins artifacts)
- Always clean up local kubeconfig files after analysis:
rm /tmp/kubeconfig-* - Include cluster findings in analysis report if accessed
-
Extract Error Context and Provide Actionable Analysis:
- ALWAYS fetch and analyze child job console logs for Runner and ginkgo-test failures
- Include failed test file path and scenario/test name in analysis
- If child job console not accessible, note the job URL and recommend manual check
For Cucumber (Runner job failures):
- Provide feature file path and failed scenario
- Repository: verification-tests or cucushift
- DO NOT search in openshift-tests-private (that's for Ginkgo tests only)
For Ginkgo (ginkgo-test job failures):
-
Provide test file path, case ID, and error message
-
Repository: openshift-tests-private
-
DO NOT search in verification-tests/cucushift (those are for Cucumber tests only)
-
Include assessment of whether this is test automation issue vs product bug
-
If applicable: Provide instructions for accessing live cluster for deep troubleshooting
Analysis Output Template:
## Jenkins Job Analysis: stage-testing
**Job Information**:
- URL: {jenkins_url}
- Console Log: {console_text_url}
- Build: #{build_number}
- Status: {SUCCESS|FAILURE|UNSTABLE} {emoji}
- Payload: {pull_spec}
- Version: {version}
- Metadata Advisory: {metadata_ad or "N/A (Konflux flow)"}
**Pipeline Analysis**:
### Stage Execution Summary
{Parse console log for all stages}
| Stage Name | Status | Notes |
|------------|--------|-------|
| {stage1} | {✅|❌|⏭️} | {brief note} |
| {stage2} | {✅|❌|⏭️} | {brief note} |
...
### Failed Stages Detail
{For each failed stage:}
#### Stage: {name}
**Status**: ❌ FAILED
**Error Messages**:{Extract relevant error lines from console log}
**Analysis**:
{Interpret what went wrong in this stage}
**Child Jobs** (if any):
- {child_job_url}: {status}
### Child Job Results
{If child jobs detected:}
- Job 1: {url} - {✅ SUCCESS | ❌ FAILURE}
- Job 2: {url} - {✅ SUCCESS | ❌ FAILURE}
...
**Overall Assessment**:
**Root Cause**: {Identify primary reason for failure}
**Failure Category**: {Infrastructure | Product Bug | Test Automation | Configuration}
**Recommendation**: {✅ PASS | ❌ FAIL | 🔄 RETRY}
**Rationale**: {Why you're making this recommendation}
**Next Actions**:
1. {Recommended action}
2. {Recommended action}
Scan console log for common error patterns:
Infrastructure Issues:
- Network timeouts
- Connection refused
- DNS resolution failures
- Quota/resource exhaustion
- Platform API errors
Product Bugs:
- Component crashes
- Assertion failures
- Unexpected behavior
- Regression indicators
Configuration Issues:
- Missing environment variables
- Invalid parameters
- Permission denied
- Missing credentials
Test Automation Issues:
- Test framework errors
- Flaky test indicators
- Race conditions
Group similar errors and identify the most frequent patterns.
Present findings in a clear, actionable format:
# Jenkins Job Failure Analysis Summary
## Job Details
- **Type**: stage-testing
- **Build**: #{build_number}
- **Status**: {status}
- **URL**: {jenkins_url}
- **Console Log**: {console_text_url}
## Quick Assessment
{One-paragraph executive summary}
## Detailed Findings
{Include job-specific analysis from step 4}
## Error Patterns
{If errors found, group and present:}
### Pattern 1: {Error type}
- **Occurrences**: {count}
- **Example**:{error message excerpt}
- **Impact**: {High | Medium | Low}
- **Likely Cause**: {hypothesis}
### Pattern 2: {Error type}
...
## Decision Matrix
| Criterion | Status | Details |
|-----------|--------|---------|
| Critical failures | {✅|❌} | {summary} |
| Known issues | {✅|⚠️|❌} | {summary} |
| Infrastructure | {✅|⚠️|❌} | {summary} |
| Blocking bugs | {✅|❌} | {summary} |
## Final Recommendation
{Large, clear emoji: ✅ PASS | ❌ FAIL | ⚠️ NEEDS REVIEW | 🔄 RETRY}
**Rationale**:
{Clear explanation of recommendation}
**Confidence Level**: {High | Medium | Low}
## Required Actions
{If FAIL or NEEDS REVIEW:}
1. **Immediate**: {action}
2. **Before retry**: {action}
3. **Long-term**: {action}
{If PASS:}
No critical actions required. Monitor for:
- {Thing to watch}
## Related Resources
- Console log: {console_text_url}
- Job page: {jenkins_url}
{If bugs mentioned:}
- Related bugs: {bugzilla/jira links}
Console Log Access:
- Publicly accessible, no authentication needed
- Use
/consoleTextfor plain text (easier to parse) - Can be very large - be selective in what you analyze
Token Management Strategy:
- Don't load entire 20K line logs
- Use
tail -n 2000for recent output - Use
grepto filter for ERROR/FAIL/Exception - Focus on relevant sections based on job type
Job-Specific Behavior:
- Each job has unique output format - don't expect standardization
- Parse console text directly - no JUnit XML available
- Look for job-type-specific markers and sections
Context:
- These jobs are part of OAR z-stream release workflow
- Triggered by commands like:
oar -r 4.19.1 stage-testing
- Failures may block release approval
Decision Guidelines:
- PASS: No critical issues, or only known/waivable issues
- FAIL: Critical blockers (image inconsistency, unverified CVE bugs, genuine product bugs)
- NEEDS REVIEW: Unclear root cause, needs human judgment
- RETRY: Infrastructure/transient failures
/ci:analyze-jenkins-failures https://jenkins-csb-openshift-qe-mastern.dno.corp.redhat.com/job/zstreams/job/Stage-Pipeline/1413/The command will:
- Fetch console log from public endpoint
- Detect it's a stage-testing job
- Parse the structured output sections
- Provide analysis and recommendation