Skip to content

CNV-74265: Hot cluster CI update#4099

Open
galkremer1 wants to merge 42 commits into
mainfrom
CNV-74265-hot-cluster-ci
Open

CNV-74265: Hot cluster CI update#4099
galkremer1 wants to merge 42 commits into
mainfrom
CNV-74265-hot-cluster-ci

Conversation

@galkremer1

@galkremer1 galkremer1 commented Jun 17, 2026

Copy link
Copy Markdown
Member

📝 Description

Continues CNV-74265 / PR #3713: move KubeVirt plugin E2E from Prow (new AWS cluster per run) to a persistent IBM Cloud ROKS hot cluster with GitHub Actions + ARC.

This PR rebases Scott's hot-cluster POC onto current main and hardens it for upstream use:

  • Playwright gating tests (replacing Cypress)
  • Workflow cleanup: hot-cluster-e2e.yml (PR + manual entry) and hot-cluster-e2e-run.yml (build plugin, provision test stack, run tests); removed duplicate poc-e2e-ci-test*.yml
  • Secrets: use org IC_KEY for IBM Cloud (replacing IBM_CLOUD_API_KEY)
  • Hardening: secure oc install via install-oc-client.sh, HCO InstallPlan polling, fail-fast runner checks, per-run plugin image tags

Cluster setup/teardown workflows (ibmc-cluster-*.yml), ARC, ci-env-controller, and ci-test-stack Helm charts are included from the POC. See ci-scripts/README.md for architecture and testing steps.

🔗 Links

Jira ticket: https://redhat.atlassian.net/browse/CNV-74265

Related: #3713

🎥 Demo

WIP — first green run pending IBM Cloud IAM completion.

Latest test runs (2026-06-24):

See testing update comment for IAM details and next steps.

Test plan

  • Push branch and open PR against main
  • Run IBM Cloud Hot Cluster Setup if no hot cluster / ARC exists — blocked on Classic infrastructure Super User for IC_KEY
  • Confirm ci-env-controller is running on the cluster (oc get deploy -n ci-env)
  • Run Hot Cluster E2E via manual dispatch
  • Verify Playwright artifacts uploaded and test stack cleaned up

Summary by CodeRabbit

  • New Features

    • Added Hot Cluster E2E GitHub Actions workflows with configurable Playwright test dispatch, richer summaries, and automated diagnostics/log collection.
    • Introduced a ConfigMap-triggered CI test environment controller (Helm-based) that provisions per-run stacks, reports status, and performs TTL-based cleanup/teardown.
    • Added IBM Cloud hot-cluster idle detection with automatic teardown plus manual override.
  • Infrastructure & Automation

    • Added composite actions to request/release CI test environments.
    • Implemented enhanced IBM Cloud cluster setup and teardown, including optional cleanup of offline runners.
  • Tests

    • Added Cypress coverage for virtualization tab/YAML visibility.
  • Documentation

    • Updated hot-cluster and CI environment documentation, including ghost runner cleanup guidance.
  • Chores

    • Refreshed ignore rules and CI build/script behavior for efficiency.

@openshift-ci-robot

openshift-ci-robot commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

@galkremer1: This pull request references CNV-74265 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

📝 Description

Continues CNV-74265 / PR #3713: move KubeVirt plugin E2E from Prow (new AWS cluster per run) to a persistent IBM Cloud ROKS hot cluster with GitHub Actions + ARC.

This PR rebases Scott's hot-cluster POC onto current main and hardens it for upstream use:

  • Playwright gating tests (replacing Cypress)
  • Workflow cleanup: hot-cluster-e2e.yml (PR + manual entry) and hot-cluster-e2e-run.yml (build plugin, provision test stack, run tests); removed duplicate poc-e2e-ci-test*.yml
  • Secrets: use org IC_KEY for IBM Cloud (replacing IBM_CLOUD_API_KEY)
  • Hardening: secure oc install via install-oc-client.sh, HCO InstallPlan polling, fail-fast runner checks, per-run plugin image tags

Cluster setup/teardown workflows (ibmc-cluster-*.yml), ARC, ci-env-controller, and ci-test-stack Helm charts are included from the POC. See docs/HOT_CLUSTER_CI_CONTINUATION.md for architecture and testing steps.

🔗 Links

Jira ticket: https://redhat.atlassian.net/browse/CNV-74265

Related: #3713

🎥 Demo

WIP — pending first green run on kubevirt-ui/kubevirt-plugin after:

  1. IBM Cloud Hot Cluster Setup (if cluster/ARC not ready)
  2. Hot Cluster E2E manual dispatch or PR trigger

Test plan

  • Push branch and open PR against main
  • Run IBM Cloud Hot Cluster Setup if no hot cluster / ARC exists
  • Confirm ci-env-controller is running on the cluster (oc get deploy -n ci-env)
  • Run Hot Cluster E2E via manual dispatch
  • Verify Playwright artifacts uploaded and test stack cleaned up

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: galkremer1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved This issue is something we want to fix label Jun 17, 2026
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds CI environment controller assets, new Helm charts, composite actions for requesting and releasing environments, hot-cluster GitHub workflows, IBM Cloud teardown automation, and supporting scripts, docs, and test coverage.

Changes

CI Environment Controller and Test Stack

Layer / File(s) Summary
Controller charts and install flow
ci-scripts/helm/ci-env-controller/..., ci-scripts/helm/ci-test-stack/..., ci-scripts/ci-env/install-ci-env-controller.sh, ci-scripts/images/setup-ci-env-runner-image.sh, ci-scripts/images/ci-env-runner/Dockerfile
Helm metadata, values, helpers, NOTES, Deployments, Services, Routes, ConfigMaps, ServiceAccounts, RBAC templates, and the install path define the controller and test-stack contracts.
Controller runtime
ci-scripts/helm/ci-env-controller/scripts/ci-env-controller.sh
The controller reconciles ConfigMaps, provisions and tears down test stacks, and reaps stale environments.
Environment request/release actions
.github/actions/ci-env-request/action.yml, .github/actions/ci-env-release/action.yml
Composite actions create and release CI environments and surface the resulting bridge and console endpoints.
Environment coverage
cypress/tests/gating/poc-check-tab-yaml.cy.ts, cypress/tests/poc-gating.cy.ts, ci-scripts/test-cleanup.sh
Cypress coverage and cleanup scripts exercise the new test flow.

Hot Cluster Workflows

Layer / File(s) Summary
E2E workflow entry and execution
.github/workflows/hot-cluster-e2e.yml, .github/workflows/hot-cluster-e2e-run.yml
The entry workflow validates cluster access and delegates to a reusable run workflow that builds the plugin image, provisions CI environments, runs Playwright, and collects artifacts.
Cluster teardown and auto-teardown
.github/workflows/ibmc-cluster-auto-teardown.yml, .github/workflows/ibmc-cluster-teardown.yml
Idle detection, teardown dispatch, cluster deletion, Helm cleanup, and ghost-runner removal are added to the IBM Cloud lifecycle.
Cluster setup and diagnostics
.github/workflows/ibmc-cluster-setup.yml, ci-scripts/log-ibmcloud-iam-diagnostics.sh
Cluster provisioning now logs IAM diagnostics, changes flavor and cluster creation handling, and installs the CI environment controller.

Docs, Helpers, and Local Tooling

Layer / File(s) Summary
Docs, ignores, and helper logic
.dockerignore, .gitignore, .prettierignore, .prettierignorecode, .vscode/settings.json, Dockerfile, ci-scripts/README.md, ci-scripts/POC_OUTLINE.md, ci-scripts/_cluster-helpers.sh, ci-scripts/arc/README.md, ci-scripts/arc/arc-runner-scale-set.pod.yaml, ci-scripts/arc/install-runner-scale-set.sh, ci-scripts/check-cluster-health.sh, ci-scripts/check-roks-cluster-state.sh, ci-scripts/resolve-console-image.sh, ci-scripts/start-console.sh, ci-scripts/start-plugin-container.sh, ci-scripts/nginx-9080.conf, ci-scripts/nginx-9443.conf, start-console.sh
Repository ignores, editor settings, documentation, helper functions, console/image resolution, nginx configs, and local container scripts are updated together.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • adamviktora
  • sjd78
  • metalice
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch CNV-74265-hot-cluster-ci

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Jira Validation Passed

CNV-74265

Check Status Details
Story Points Story points: 8
Fix Version Fix version matches target branch main (expected: 4.23)
Component Component "CNV User Interface" is set
Activity Type Activity Type is set

@github-actions github-actions Bot added the do-not-merge/jira-invalid Automated label for Jira integration label Jun 17, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
ci-scripts/images/setup-arc-runner-image.sh (1)

121-127: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Restore IMAGE_REF= stdout contract

Line 126 is commented, so the script no longer emits the IMAGE_REF= line promised at the top of the file. This can break callers that parse stdout.

Proposed fix
 if [[ -n "${ARC_RUNNER_IMAGE_FILE:-}" ]]; then
   printf '%s\n' "${IMAGE_REF}" > "${ARC_RUNNER_IMAGE_FILE}"
   echo "Wrote ${ARC_RUNNER_IMAGE_FILE}"
 fi
-#echo "IMAGE_REF=${IMAGE_REF}"
+echo "IMAGE_REF=${IMAGE_REF}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/images/setup-arc-runner-image.sh` around lines 121 - 127, The echo
statement that outputs the IMAGE_REF variable to stdout is currently commented
out on the line showing `#echo` "IMAGE_REF=${IMAGE_REF}". Uncomment this line to
restore the stdout contract that callers depend on for parsing the image
reference.
ci-scripts/images/setup-ci-env-runner-image.sh (1)

118-124: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing IMAGE_REF= output breaks controller install flow

install-ci-env-controller.sh explicitly greps this script’s stdout for IMAGE_REF= and exits on empty output, so Line 123 being commented is a hard failure path. Also, the optional file-output variable name is inconsistent with the script header.

Proposed fix
-# TODO: Better handling of passing the fqdn image name to the caller
-if [[ -n "${CI_ENV_RUNNER_IMAGE_FILE:-}" ]]; then
-  printf '%s\n' "${IMAGE_REF}" > "${CI_ENV_RUNNER_IMAGE_FILE}"
-  echo "Wrote ${CI_ENV_RUNNER_IMAGE_FILE}"
-fi
-#echo "IMAGE_REF=${IMAGE_REF}"
+# TODO: Better handling of passing the fqdn image name to the caller
+echo "IMAGE_REF=${IMAGE_REF}"
+CI_ENV_IMAGE_FILE="${CI_ENV_CONTROLLER_IMAGE_FILE:-${CI_ENV_RUNNER_IMAGE_FILE:-}}"
+if [[ -n "${CI_ENV_IMAGE_FILE}" ]]; then
+  printf '%s\n' "${IMAGE_REF}" > "${CI_ENV_IMAGE_FILE}"
+  echo "Wrote ${CI_ENV_IMAGE_FILE}"
+fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/images/setup-ci-env-runner-image.sh` around lines 118 - 124,
Uncomment the line that echoes the IMAGE_REF variable (the `echo
"IMAGE_REF=${IMAGE_REF}"` line) since the install-ci-env-controller.sh script
explicitly greps stdout for the IMAGE_REF= output and will fail if this line
remains commented out. Additionally, review the variable name
CI_ENV_RUNNER_IMAGE_FILE used in the optional file-output block and verify it
matches the naming convention documented in the script header for consistency.
.github/workflows/hot-cluster-e2e.yml (1)

93-97: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid broad secrets: inherit in reusable workflow calls.

This passes all available secrets to the called workflow. Limit to explicit secrets required by hot-cluster-e2e-run.yml to enforce least privilege.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/hot-cluster-e2e.yml around lines 93 - 97, The `secrets:
inherit` statement passes all available secrets to the hot-cluster-e2e-run.yml
reusable workflow, violating the principle of least privilege. Replace `secrets:
inherit` with an explicit mapping that only includes the specific secrets
required by hot-cluster-e2e-run.yml. Use the format `secrets:` followed by
individual secret declarations (e.g., `SECRET_NAME: ${{ secrets.SECRET_NAME }}`)
for each secret actually needed by the called workflow, rather than inheriting
all secrets indiscriminately.

Source: Linters/SAST tools

🟠 Major comments (37)
cypress/tests/gating/poc-check-tab-yaml.cy.ts-107-232 (1)

107-232: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Remove commented-out test code.

Lines 107-232 contain 126 lines of commented-out test code for Templates, InstanceTypes, Bootable volumes, MigrationPolicies, and Checkups. Committed commented-out code is a maintainability issue:

  • It clutters the codebase and reduces readability
  • It creates confusion about whether the code should be used
  • Version control already preserves deleted code if it's needed later

If these tests are work-in-progress, track them in a Jira issue instead. If they're no longer needed, remove them.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cypress/tests/gating/poc-check-tab-yaml.cy.ts` around lines 107 - 232, Remove
all the commented-out test code from the file. Delete the entire block of
commented code that includes the describe blocks for 'Check Templates page',
'Check InstanceTypes tabs', 'Check Bootable volumes page', 'Check
MigrationPolicies page', and 'Check Checkups tabs'. This commented-out code
clutters the codebase and reduces maintainability. If these tests are needed in
the future, they can be recovered from version control history.
cypress/tests/gating/poc-check-tab-yaml.cy.ts-18-18 (1)

18-18: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace arbitrary time-based waits with condition-based waits.

Lines 18 and 30 use cy.wait() with hardcoded timeouts (15 and 10 seconds), which is an anti-pattern in Cypress tests. According to the coding guidelines, you should use intercept() and wait() to handle network requests with appropriate timeouts.

Instead of arbitrary waits, prefer:

  • Waiting for specific DOM conditions (cy.get(...).should('exist'))
  • Intercepting network requests and waiting for them to complete
  • Using built-in Cypress retry-ability with assertions
♻️ Recommended approach

For line 18, instead of:

- cy.wait(15 * SECOND);

Wait for a specific condition that indicates the VM has processed the start action:

cy.get(sel.vmStatusOnOverview, { timeout: 15 * SECOND }).should('be.visible');

For line 30, the subsequent test already checks the status, so this wait might be redundant or could be replaced with a more specific assertion.

Also applies to: 30-30

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cypress/tests/gating/poc-check-tab-yaml.cy.ts` at line 18, Replace the
arbitrary time-based waits using cy.wait(15 * SECOND) at line 18 and cy.wait(10
* SECOND) at line 30 with condition-based waits that verify specific DOM
elements or network requests. For line 18, replace the cy.wait() call with
cy.get(sel.vmStatusOnOverview, { timeout: 15 * SECOND }).should('be.visible') to
wait for the VM status element to actually appear rather than waiting
arbitrarily. For line 30, either remove the wait if the subsequent test already
validates the expected state, or replace it with a specific DOM or network
condition assertion that confirms the necessary precondition has been met.

Source: Coding guidelines

ci-scripts/install-oc-client.sh-22-24 (1)

22-24: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prefer OpenShift version fields before masterKubeVersion when deriving OPENSHIFT_VERSION.

At Line 22, prioritizing masterKubeVersion can resolve to a Kubernetes version (e.g., 1.xx), which then builds an invalid OCP client URL at Line 46 (stable-1.xx). Use .openshiftVersion/.version first, then fallback.

Also applies to: 46-46

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/install-oc-client.sh` around lines 22 - 24, The order of field
prioritization in the fallback chain for deriving OPENSHIFT_VERSION is
incorrect. Currently, masterKubeVersion is being checked first, which can
resolve to a Kubernetes version (e.g., 1.xx) instead of an OpenShift version,
causing an invalid OCP client URL to be constructed later (e.g., stable-1.xx at
line 46). Reorder the fallback chain to prioritize openshiftVersion and version
first, then fall back to masterKubeVersion as the last option. This ensures that
OpenShift-specific version fields are used before the generic Kubernetes version
field.
ci-scripts/check-roks-cluster-state.sh-26-30 (1)

26-30: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t swallow ibmcloud errors into {} in the polling loop.

Falling back to "{}" masks auth/cluster-not-found/API failures and turns hard errors into long timeouts. Fail fast on command failure (or after a small retry budget) and surface stderr.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/check-roks-cluster-state.sh` around lines 26 - 30, The ibmcloud oc
cluster get command on the CLUSTER_JSON assignment is currently suppressing
errors by falling back to an empty JSON object with "|| echo "{}"". Remove this
fallback mechanism so that authentication failures, cluster-not-found errors,
and API failures cause the script to fail immediately rather than masking the
actual problem. Let the command's stderr be visible to the user and either fail
fast on command failure or implement a small retry budget before giving up,
ensuring that hard errors are surfaced instead of being hidden in a long timeout
loop.
ci-scripts/check-roks-cluster-state.sh-15-16 (1)

15-16: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate MAX_WAIT/INTERVAL as positive integers before the loop.

INTERVAL=0 causes a non-progressing loop (ELAPSED never increases), and non-numeric values can break sleep/arithmetic paths.

Also applies to: 45-46

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/check-roks-cluster-state.sh` around lines 15 - 16, The MAX_WAIT
and INTERVAL variables are set with default values but lack validation before
being used in the loop, which can cause issues like infinite loops when INTERVAL
is zero or script failures with non-numeric values. Add validation logic after
the variable assignments (at lines 15-16 and also at lines 45-46 as indicated)
to ensure both MAX_WAIT and INTERVAL are positive integers, exiting with an
error message if validation fails. This validation should occur before any loop
or arithmetic operations that depend on these variables.
ci-scripts/_cluster-helpers.sh-35-43 (1)

35-43: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the route’s targetPort when rewriting internal service URLs.

The current rewrite picks service.spec.ports[0].port, which can mismatch the route backend port. If the route targets a non-first port, the rewritten URL is wrong and downstream CLI downloads fail.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/_cluster-helpers.sh` around lines 35 - 43, The current
implementation extracts the service name from the route but ignores the route's
targetPort specification, instead always using the first port from the service
spec. Modify the jq query that extracts route_info to also capture the route's
targetPort from the spec.to object, then use this extracted targetPort value in
the final URL construction instead of querying the service for
spec.ports[0].port. This ensures the rewritten URL targets the correct backend
port that the route actually specifies.
ci-scripts/arc/arc-runner-rbac.yaml-27-45 (1)

27-45: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Reduce runner RBAC blast radius (cluster-wide write access is too broad).

Granting cluster-scoped patch on virtualmachines plus broad cluster reads to the PR runner SA materially increases impact if a runner job is abused. Scope write permissions to the ephemeral test namespace via namespaced Role/RoleBinding, and keep cluster-wide access strictly read-only and minimal.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/arc/arc-runner-rbac.yaml` around lines 27 - 45, Remove the
cluster-wide write access by eliminating the 'patch' verb from the
virtualmachines resource rule in this ClusterRole, retaining only the read-only
verbs 'get', 'list', and 'watch'. Create a separate namespaced Role bound to the
test namespace that grants the 'patch' verb on virtualmachines resources,
ensuring Cypress cy.exec runs can still patch virtualmachines but only within
the ephemeral test namespace, thus limiting the RBAC blast radius if the runner
is compromised.
ci-scripts/arc/arc-runner-scale-set.pod.yaml-23-23 (1)

23-23: ⚠️ Potential issue | 🟠 Major

Pin runner image to an immutable tag or digest instead of latest.

Using ghcr.io/actions/actions-runner:latest makes CI behavior non-deterministic and can pull unreviewed upstream changes between runs. Replace with a specific version tag or image digest.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/arc/arc-runner-scale-set.pod.yaml` at line 23, Replace the image
tag "latest" in the ghcr.io/actions/actions-runner image specification with a
specific version tag or an immutable image digest. Instead of using the "latest"
tag which pulls unpredictable versions, pin the image to a concrete version
(e.g., a specific release tag like "v2.315.0") or use the full image digest hash
to ensure reproducible and deterministic CI behavior across runs.
ci-scripts/install-oc-client.sh-48-54 (1)

48-54: ⚠️ Potential issue | 🟠 Major

Download and verify the oc archive checksum before extraction.

The script lacks integrity validation. OpenShift's mirrors provide sha256sum.txt and sha256sum.txt.gpg files in the same directory. Download the checksum file, verify its GPG signature with the Red Hat release key, then validate the archive hash before extraction:

curl -fsSL -o "${ARCHIVE}" "${URL}"
curl -fsSL -o "${CHECKSUMS}" "${URL%/*}/sha256sum.txt"
curl -fsSL -o "${CHECKSUMS}.gpg" "${URL%/*}/sha256sum.txt.gpg"
gpg --verify "${CHECKSUMS}.gpg" "${CHECKSUMS}"
grep "openshift-client-linux.tar.gz" "${CHECKSUMS}" | sha256sum -c -
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/install-oc-client.sh` around lines 48 - 54, The script downloads
the oc archive but does not verify its integrity before extraction, creating a
security risk. After the successful curl download of ARCHIVE from URL, add steps
to download the sha256sum.txt and sha256sum.txt.gpg files from the same URL
directory, verify the GPG signature of the checksums file using the Red Hat
release key, and then validate the archive hash against the checksums before
proceeding with the tar extraction. Only allow the tar extraction and install
commands to execute if the checksum verification succeeds.
ci-scripts/images/ci-env-runner/helm/ci-test-stack-1-1 (1)

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Symlinked chart path is causing CI YAML/Helm parse failures

The new symlinked chart location is very likely what triggered the repeated parser errors in CI (invalid YAML at multiple template lines). This duplicates Helm templates under a path that build-time YAML checks appear to scan as plain YAML. Please either exclude this path from the YAML parse step or avoid exposing the chart tree here at repo level.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/images/ci-env-runner/helm/ci-test-stack` at line 1, The symbolic
link pointing from ci-env-runner/helm/ci-test-stack to
../../../helm/ci-test-stack is being picked up and scanned by the CI build-time
YAML validation checks as plain YAML files, which is causing parse failures on
Helm template syntax. Either remove this symlink entirely and use direct file
references where needed, or configure the YAML parsing step in your CI pipeline
to explicitly exclude the ci-scripts/images/ci-env-runner/helm directory from
validation checks so symlinked Helm charts are not treated as raw YAML.

Source: Pipeline failures

.github/workflows/ibmc-cluster-setup.yml-130-138 (1)

130-138: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid direct ${{ inputs.* }} expansion inside shell commands.

Direct template expansion in run: allows shell-level injection via crafted workflow inputs. Bind inputs via env: and use shell variables instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ibmc-cluster-setup.yml around lines 130 - 138, The
workflow is directly expanding workflow inputs using ${{ inputs.* }} syntax
within the shell command, which creates a potential security vulnerability for
shell injection. Move the input parameters (worker_count, worker_flavor, and
openshift_version) into environment variables using the env: section above the
run: step, then reference those environment variables using shell variable
syntax (e.g., $WORKER_COUNT instead of ${{ inputs.worker_count }}) within the
ibmcloud oc cluster create classic command to safely pass the inputs to the
shell.

Source: Linters/SAST tools

.github/workflows/ibmc-cluster-auto-teardown.yml-153-153 (1)

153-153: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove template expansion from inline JavaScript.

Injecting ${{ env.CLUSTER_NAME }} directly into JS source introduces script-injection risk. Read process.env.CLUSTER_NAME inside the script instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ibmc-cluster-auto-teardown.yml at line 153, The inline
JavaScript code at line 153 in ibmc-cluster-auto-teardown.yml has a script
injection vulnerability because cluster_name is being set with the GitHub
Actions template expansion syntax ${{ env.CLUSTER_NAME }} directly in the
JavaScript source. Remove this template expansion from the cluster_name
assignment and instead access the environment variable directly within the
script using process.env.CLUSTER_NAME to safely read the environment variable
value at runtime.

Source: Linters/SAST tools

.github/workflows/hot-cluster-e2e-run.yml-29-29 (1)

29-29: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include github.run_attempt in ephemeral resource names.

TEST_NS and CI_ENV_CM currently key only on github.run_id, so reruns can collide with stale resources from earlier attempts.

Suggested fix
-  TEST_NS: kubevirt-plugin-ci-test-${{ github.run_id }}
+  TEST_NS: kubevirt-plugin-ci-test-${{ github.run_id }}-${{ github.run_attempt }}
@@
-  CI_ENV_CM: ci-env-${{ github.run_id }}
+  CI_ENV_CM: ci-env-${{ github.run_id }}-${{ github.run_attempt }}

Also applies to: 35-35

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/hot-cluster-e2e-run.yml at line 29, The TEST_NS and
CI_ENV_CM ephemeral resource names only use github.run_id, which causes
collisions when a workflow is retried since multiple attempts share the same
run_id. Update both the TEST_NS variable definition (around line 29) and the
CI_ENV_CM variable definition (around line 35) to include github.run_attempt in
addition to github.run_id, ensuring each workflow attempt gets a unique resource
name by appending the attempt number to the existing identifier.
.github/workflows/ibmc-cluster-teardown.yml-34-34 (1)

34-34: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Job timeout is shorter than the deletion wait window.

The job times out at 60 minutes, but the script intentionally waits up to 120 minutes. This can terminate teardown mid-flight and produce noisy failures.

Also applies to: 79-96

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ibmc-cluster-teardown.yml at line 34, The job timeout is
set to 60 minutes but the deletion script waits up to 120 minutes, causing
premature job termination. Increase the timeout-minutes value at line 34 from 60
to at least 120 to accommodate the full deletion wait window. Also check and
apply the same fix to the timeout settings in lines 79-96 where this issue also
applies to ensure all teardown jobs have sufficient time to complete.
.github/actions/ci-env-request/action.yml-34-49 (1)

34-49: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make trigger ConfigMap creation idempotent.

Using oc create causes immediate failure on retries/reruns when the ConfigMap already exists, which blocks environment provisioning.

Suggested fix
-        cat <<EOF | oc create -f -
+        cat <<EOF | oc apply -f -
         apiVersion: v1
         kind: ConfigMap
         metadata:
           name: ${{ inputs.configmap-name }}
           namespace: ${{ inputs.ci-env-namespace }}
           labels:
             ci.kubevirt-plugin/type: test-environment
         data:
           desired-state: "present"
           plugin-image: "${{ inputs.plugin-image }}"
           test-namespace: "${{ inputs.test-namespace }}"
         EOF
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/ci-env-request/action.yml around lines 34 - 49, The "Create
trigger ConfigMap" step uses `oc create` which will fail if the ConfigMap
already exists, breaking idempotency on retries or reruns. Replace the `oc
create -f -` command with `oc apply -f -` to make the operation idempotent,
allowing the ConfigMap to be created if it doesn't exist or updated if it
already does.
.github/workflows/hot-cluster-e2e-run.yml-289-293 (1)

289-293: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not inject ${{ inputs.test_project }} directly into bash code.

This is template-injection prone. Bind test_project via env: and compare the environment variable in shell.

Suggested fix
       - name: Run Playwright gating tests
         env:
           BRIDGE_BASE_ADDRESS: ${{ steps.ci-env.outputs.bridge-base-address }}
+          TEST_PROJECT: ${{ inputs.test_project }}
         run: |
-          if [[ "${{ inputs.test_project }}" == "features" ]]; then
+          if [[ "${TEST_PROJECT}" == "features" ]]; then
             npm run test-playwright-headless -- --project=features
           else
             npm run test-playwright-headless -- --project=gating
           fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/hot-cluster-e2e-run.yml around lines 289 - 293, The direct
injection of ${{ inputs.test_project }} into the bash conditional is a
template-injection vulnerability. Instead, add an `env:` section at the step
level to bind the input value to an environment variable (e.g., TEST_PROJECT),
then replace the direct template injection ${{ inputs.test_project }} with the
environment variable reference $TEST_PROJECT in the conditional comparison.

Source: Linters/SAST tools

.github/workflows/ibmc-cluster-auto-teardown.yml-35-35 (1)

35-35: ⚠️ Potential issue | 🟠 Major

Pin third-party actions to immutable commit SHAs.

Replace version tags with commit SHAs:

  • Line 35: actions/github-script@v8
  • Line 86: IBM/actions-ibmcloud-cli@v1
  • Line 144: actions/github-script@v8

Version tags are mutable and can change unexpectedly, compromising build determinism and auditability. Pinning to specific commit SHAs ensures reproducible and secure CI/CD workflows.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ibmc-cluster-auto-teardown.yml at line 35, Replace the
mutable version tag references with immutable commit SHAs for all third-party
GitHub Actions in the workflow file. For the actions/github-script action at
line 35 and line 144, replace `@v8` with the corresponding commit SHA. For the
IBM/actions-ibmcloud-cli action at line 86, replace `@v1` with the corresponding
commit SHA. Each uses field should reference a specific commit hash instead of a
version tag to ensure reproducible and secure CI/CD workflows.

Source: Linters/SAST tools

.github/actions/ci-env-release/action.yml-49-58 (1)

49-58: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not delete the trigger ConfigMap when cleanup times out.

Current flow deletes the ConfigMap even after timeout, which can prevent the controller from finishing teardown and leak cluster resources.

Suggested fix
         echo "Waiting for controller to clean up..."
         INTERVAL=5
         ELAPSED=0
+        CLEANED=false
 
         while true; do
           STATUS="$(oc get configmap "${CM_NAME}" -n "${CM_NS}" \
             -o jsonpath='{.data.status}' 2>/dev/null || echo "")"
 
           if [[ "${STATUS}" == "cleaned" ]]; then
             echo "Cleanup complete."
+            CLEANED=true
             break
           fi
@@
           if (( ELAPSED >= TIMEOUT )); then
             echo "::warning::Timed out waiting for controller cleanup (status=${STATUS})"
             break
           fi
@@
         done
 
-        oc delete configmap "${CM_NAME}" -n "${CM_NS}" 2>/dev/null || true
+        if [[ "${CLEANED}" == "true" ]]; then
+          oc delete configmap "${CM_NAME}" -n "${CM_NS}" 2>/dev/null || true
+        else
+          echo "::warning::Skipping ConfigMap deletion because cleanup did not reach status=cleaned"
+        fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/ci-env-release/action.yml around lines 49 - 58, The
ConfigMap deletion using oc delete configmap "${CM_NAME}" -n "${CM_NS}" is being
executed unconditionally after the cleanup loop, even when the loop times out
and breaks due to ELAPSED >= TIMEOUT. To fix this, only execute the ConfigMap
deletion when the cleanup completes successfully without timing out. You can
achieve this by either moving the deletion inside the loop before the break
statement when timeout occurs, or by wrapping the deletion in a condition that
ensures it only runs when the timeout was not reached. This prevents resource
leaks by allowing the controller to complete its teardown without the ConfigMap
being prematurely removed.
.github/workflows/ibmc-cluster-setup.yml-50-53 (1)

50-53: ⚠️ Potential issue | 🟠 Major

Pin external actions to commit SHAs.

Floating refs (@v5, @v1) are mutable and vulnerable to tag manipulation and malicious code injection, weakening workflow supply-chain integrity. Pin each action to its full-length commit SHA with the version tag as a comment for readability:

uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v5
uses: IBM/actions-ibmcloud-cli@<full-sha> # v1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ibmc-cluster-setup.yml around lines 50 - 53, Pin the
GitHub actions to their full-length commit SHAs for security. Replace the
floating version refs in the uses clauses: change actions/checkout@v5 to
actions/checkout@<full-commit-sha> # v5 and IBM/actions-ibmcloud-cli@v1 to
IBM/actions-ibmcloud-cli@<full-commit-sha> # v1, where the full-commit-sha is
the complete 40-character commit hash for each action. Include the version tag
as a comment after each SHA for readability and maintainability.

Source: Linters/SAST tools

.github/workflows/hot-cluster-e2e.yml-53-56 (1)

53-56: ⚠️ Potential issue | 🟠 Major

Pin workflow actions to commit SHAs instead of floating tags.

actions/checkout@v6 and IBM/actions-ibmcloud-cli@v1 use mutable floating tags that can be retargeted by maintainers, introducing supply chain attack risk. Pin to full-length commit SHAs (with version tags in comments for readability):

uses: actions/checkout@eaf007128c319ff1ea75a61e2153a62b56ea22e09 # v4.2.1
uses: IBM/actions-ibmcloud-cli@8e728fca2c236f84cd72c13ae10e49d5feda90eb0 # v1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/hot-cluster-e2e.yml around lines 53 - 56, Replace the
mutable floating version tags in both the `actions/checkout` and
`IBM/actions-ibmcloud-cli` action references with their full-length commit SHAs
to prevent supply chain attacks. For each action, change from the format `uses:
<action>`@v`<version>` to `uses: <action>@<commit-sha>` and add a comment with the
version tag for readability using the format `# v<version>`. Look up the current
commit SHAs for checkout@v6 and ibmcloud-cli@v1 from their respective
repositories and update both action references accordingly.

Source: Linters/SAST tools

.github/workflows/ibmc-cluster-setup.yml-123-138 (1)

123-138: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only pass VLAN flags when values are present.

The script says “new VLANs will be created” when none exist, but still passes empty --private-vlan/--public-vlan values, which can fail cluster creation.

Suggested fix
-          ibmcloud oc cluster create classic \
+          VLAN_ARGS=()
+          [[ -n "${PRIVATE_VLAN}" ]] && VLAN_ARGS+=(--private-vlan "${PRIVATE_VLAN}")
+          [[ -n "${PUBLIC_VLAN}"  ]] && VLAN_ARGS+=(--public-vlan "${PUBLIC_VLAN}")
+
+          ibmcloud oc cluster create classic \
             --name "${CLUSTER_NAME}" \
             --version "${{ inputs.openshift_version }}" \
             --flavor "${{ inputs.worker_flavor }}" \
             --workers "${{ inputs.worker_count }}" \
-            --zone "${ZONE}" \
-            --private-vlan "${PRIVATE_VLAN}" \
-            --public-vlan "${PUBLIC_VLAN}"
+            --zone "${ZONE}" \
+            "${VLAN_ARGS[@]}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ibmc-cluster-setup.yml around lines 123 - 138, The
ibmcloud oc cluster create classic command always passes the --private-vlan and
--public-vlan flags even when PRIVATE_VLAN and PUBLIC_VLAN are empty, which
prevents IBM Cloud from automatically creating new VLANs. Conditionally append
these flags to the ibmcloud command only when the PRIVATE_VLAN and PUBLIC_VLAN
variables are non-empty, similar to the conditional logic already present in the
if statement that checks for existing VLANs, so that new VLANs are created
automatically when neither variable is set.
.github/workflows/ibmc-cluster-teardown.yml-61-61 (1)

61-61: ⚠️ Potential issue | 🟠 Major

Replace curl | bash with official azure/setup-helm action.

The current implementation downloads and executes unverified remote code at runtime. Use the official azure/setup-helm action (e.g., azure/setup-helm@v5) instead, which provides secure, version-pinned installation with no script execution. This eliminates the remote code execution risk while improving reliability through built-in caching and proper environment configuration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ibmc-cluster-teardown.yml at line 61, Replace the inline
curl piped to bash pattern that downloads and executes the helm installation
script from the remote GitHub repository with the official azure/setup-helm
action version 5 or later. Remove the condition checking for the helm command
and the curl command piping to bash, and instead use the azure/setup-helm@v5
action as a dedicated workflow step which provides secure, version-pinned
installation without remote script execution.
.github/workflows/ibmc-cluster-teardown.yml-37-37 (1)

37-37: ⚠️ Potential issue | 🟠 Major

Pin IBM/actions-ibmcloud-cli to a commit SHA instead of @v1.

Version tags are mutable and allow the action to be updated without your knowledge, creating supply-chain risk. Use the full commit SHA with the version tag as a comment for readability, following the pattern already established in this repository (e.g., IBM/actions-ibmcloud-cli@<commit-sha> # v1).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ibmc-cluster-teardown.yml at line 37, The
IBM/actions-ibmcloud-cli action on line 37 uses a mutable version tag (`@v1`)
which creates supply-chain risk. Replace the `@v1` tag with the full commit SHA
for that version, and add a comment with the version tag for readability in the
format: IBM/actions-ibmcloud-cli@<full-commit-sha> # v1. Look for similar
patterns already established elsewhere in the repository workflows to determine
the correct commit SHA to use for v1.

Source: Linters/SAST tools

ci-scripts/helm/ci-env-controller/templates/clusterrole-console.yaml-2-2 (1)

2-2: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Current binding model gives the console SA cluster-wide write access.

Line 2 documents this role is bound via ClusterRoleBinding. Combined with Line 61–73 mutating verbs (create/update/patch/delete), that grants write capability across all namespaces, not only the CI test namespace.

A safer pattern is:

  1. keep cluster-scoped read permissions in this ClusterRole, and
  2. move mutating namespaced permissions to a namespaced Role bound with RoleBinding per test namespace.

Also applies to: 61-73

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/helm/ci-env-controller/templates/clusterrole-console.yaml` at line
2, The ClusterRole for the console ServiceAccount currently grants cluster-wide
write access through mutating verbs (create, update, patch, delete) on
namespaced resources. Refactor this by removing the mutating verb definitions
from lines 61-73 in the ClusterRole, keeping only the read-only permissions
(get, list, watch). Create a separate namespaced Role with the mutating verbs
(create, update, patch, delete) that will be bound per CI test namespace using a
RoleBinding instead of ClusterRoleBinding. This limits write access to only the
specific test namespace while maintaining cluster-wide read access where needed.
ci-scripts/helm/ci-env-controller/templates/clusterrole-controller.yaml-11-13 (1)

11-13: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove mutating permissions from nodes access.

Line 11–13 combines nodes and namespaces with create/update/patch/delete verbs. For cluster discovery, nodes should be read-only; keeping write verbs here over-privileges the controller.

🔧 Suggested RBAC split
-  - apiGroups: ['']
-    resources: ['nodes', 'namespaces']
-    verbs: ['get', 'list', 'watch', 'create', 'update', 'patch', 'delete']
+  - apiGroups: ['']
+    resources: ['nodes']
+    verbs: ['get', 'list', 'watch']
+  - apiGroups: ['']
+    resources: ['namespaces']
+    verbs: ['get', 'list', 'watch', 'create', 'update', 'patch', 'delete']
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/helm/ci-env-controller/templates/clusterrole-controller.yaml`
around lines 11 - 13, The ClusterRole rule is granting unnecessary write
permissions (create, update, patch, delete) to the nodes resource. Split this
rule into two separate rules: one for nodes with only read-only verbs (get,
list, watch), and another for namespaces with the appropriate verbs. This
ensures the controller can only read node information for cluster discovery
without being able to modify nodes.
ci-scripts/helm/ci-env-controller/scripts/ci-env-controller.sh-82-109 (1)

82-109: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail provisioning when proxy route is unavailable.

Route creation errors are currently logged and ignored, then provisioning continues with a potentially invalid PLUGIN_PROXY_ENDPOINT. This can mark environments as provisioned while core traffic routing is broken.

Suggested fix
 ensure_proxy_route() {
@@
-  if oc get route "${route_name}" -n "${route_ns}" &>/dev/null; then
+  if oc get route "${route_name}" -n "${route_ns}" &>/dev/null; then
     log "Proxy route already exists in ${route_ns}"
   else
     log "Creating proxy route ${route_name} in ${route_ns}..."
-    cat <<EOF | oc create -f - 2>/dev/null || log "Proxy route create skipped (may already exist or namespace missing)"
+    cat <<EOF | oc create -f - 2>/dev/null || true
 apiVersion: route.openshift.io/v1
@@
 EOF
   fi
+
+  if ! oc get route "${route_name}" -n "${route_ns}" &>/dev/null; then
+    log "ERROR: proxy route ${route_name} is unavailable in ${route_ns}"
+    return 1
+  fi
 
   PLUGIN_PROXY_ENDPOINT="https://${proxy_host}"
 }
@@
-  ensure_proxy_route
+  ensure_proxy_route || {
+    patch_cm "${cm_name}" '{"data":{"status":"error","error-message":"proxy route unavailable"}}'
+    return 1
+  }

Also applies to: 145-146

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/helm/ci-env-controller/scripts/ci-env-controller.sh` around lines
82 - 109, The proxy route creation failure is silently logged and ignored,
allowing the script to continue setting PLUGIN_PROXY_ENDPOINT with a potentially
invalid value. After the oc create command in the route creation block, add a
verification check using oc get route to confirm the route was actually created
successfully. If the verification fails, exit the script with an error instead
of continuing provisioning. Only set the PLUGIN_PROXY_ENDPOINT variable after
confirming the route exists and is accessible.
ci-scripts/helm/ci-env-controller/templates/role-trigger.yaml-11-14 (1)

11-14: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Scope runner ConfigMap permissions away from controller-owned objects.

This Role grants the runner SA full CRUD on every ConfigMap in the controller namespace. A compromised or buggy runner can modify/delete ci-env-controller-script and other non-trigger ConfigMaps, causing controller outage or tampering.

Suggested direction
  • Isolate trigger ConfigMaps into a dedicated namespace and bind this Role there.
  • Keep controller operational ConfigMaps (script/config) in a separate namespace not writable by runner SAs.
  • If namespace split is deferred, at minimum remove nonessential verbs (especially delete) to reduce immediate blast radius.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/helm/ci-env-controller/templates/role-trigger.yaml` around lines
11 - 14, The Role in role-trigger.yaml grants excessive permissions that allow
the runner SA to modify or delete all ConfigMaps in the namespace, including
critical controller ConfigMaps like ci-env-controller-script. Restrict the RBAC
permissions by either implementing namespace segregation (isolate trigger
ConfigMaps to a dedicated namespace and keep controller ConfigMaps in a separate
namespace) or at minimum reduce the permission scope by removing the delete verb
and using resourceNames or label selectors to limit access to only
trigger-related ConfigMaps that the runner actually needs to interact with.
ci-scripts/helm/ci-env-controller/scripts/ci-env-controller.sh-174-176 (1)

174-176: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add timeout bounds to readiness polling curl.

The loop uses curl without --connect-timeout/--max-time; a single hung request can block reconciliation indefinitely.

Suggested fix
-    if curl -s -o /dev/null -w "%{http_code}" "${bridge_base}/" 2>/dev/null | grep -qE '200|301|302'; then
+    if curl -sS --connect-timeout 2 --max-time 5 \
+      -o /dev/null -w "%{http_code}" "${bridge_base}/" 2>/dev/null | grep -qE '200|301|302'; then
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/helm/ci-env-controller/scripts/ci-env-controller.sh` around lines
174 - 176, The curl command in the readiness polling loop that checks the
bridge_base URL lacks timeout bounds, which can cause the loop to hang
indefinitely if a request stalls. Add the --connect-timeout and --max-time flags
to the curl command that checks "${bridge_base}/" to ensure that individual
requests have reasonable time limits and cannot block reconciliation. These
timeout values should be reasonable for your use case, typically in the range of
5-10 seconds each.
ci-scripts/helm/ci-test-stack/templates/console-sa.yaml-9-20 (1)

9-20: ⚠️ Potential issue | 🟠 Major

Avoid creating a legacy long-lived service-account token Secret.

Pre-creating a kubernetes.io/service-account-token Secret introduces a persistent credential object and weakens token lifecycle security. The token is currently injected into the console Deployment as an environment variable (BRIDGE_K8S_AUTH_BEARER_TOKEN). Prefer projected bound tokens (using volumes with serviceAccountToken projection and volumeMounts) or explicit TokenRequest at runtime, though this would require the console application to read the token from a mounted file rather than an environment variable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/helm/ci-test-stack/templates/console-sa.yaml` around lines 9 - 20,
The legacy kubernetes.io/service-account-token Secret resource in the
console-sa.yaml template creates a persistent credential and weakens security.
Remove the entire Secret resource definition that uses type
kubernetes.io/service-account-token. Instead, update the console Deployment to
use projected bound tokens by adding a volumes section with a
serviceAccountToken projection and corresponding volumeMounts, removing the
BRIDGE_K8S_AUTH_BEARER_TOKEN environment variable injection. This requires the
console application to read the token from the mounted file path instead of from
an environment variable.
ci-scripts/helm/ci-test-stack/templates/_helpers.tpl-29-29 (1)

29-29: ⚠️ Potential issue | 🟠 Major

Remove hardcoded cluster.local from the plugin service URL.

Line 29 hardcodes cluster.local in the pluginUrl helper, which fails on clusters using a non-default service DNS domain. Use just .svc instead, allowing Kubernetes DNS search path resolution.

Suggested change
- http://{{ include "ci-test-stack.pluginName" . }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.plugin.port }}
+ http://{{ include "ci-test-stack.pluginName" . }}.{{ .Release.Namespace }}.svc:{{ .Values.plugin.port }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/helm/ci-test-stack/templates/_helpers.tpl` at line 29, The
pluginUrl helper template in the _helpers.tpl file hardcodes the Kubernetes DNS
domain as cluster.local in the plugin service URL, which breaks on clusters
using non-default service DNS domains. Modify the URL construction to remove the
hardcoded cluster.local domain and use only .svc, allowing Kubernetes DNS search
path resolution to automatically discover the correct domain based on the
cluster configuration.
ci-scripts/helm/ci-test-stack/templates/console-deployment.yaml-44-50 (1)

44-50: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add BRIDGE_K8S_AUTH=bearer-token to match the bearer-token flow.

Line 46 injects a bearer token, but the deployment never sets BRIDGE_K8S_AUTH. That creates a config mismatch versus the expected off-cluster auth contract and can cause console API auth failures.

Suggested fix
             - name: BRIDGE_USER_AUTH
               value: "disabled"
+            - name: BRIDGE_K8S_AUTH
+              value: "bearer-token"
             - name: BRIDGE_K8S_AUTH_BEARER_TOKEN
               valueFrom:
                 secretKeyRef:
                   name: {{ include "ci-test-stack.consoleName" . }}-token
                   key: token
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/helm/ci-test-stack/templates/console-deployment.yaml` around lines
44 - 50, The deployment injects a bearer token via BRIDGE_K8S_AUTH_BEARER_TOKEN
but does not set the BRIDGE_K8S_AUTH environment variable to specify the
authentication mechanism. Add a new environment variable entry for
BRIDGE_K8S_AUTH with value "bearer-token" in the env section of the
console-deployment template, positioned alongside the other auth-related
variables like BRIDGE_USER_AUTH and BRIDGE_K8S_AUTH_BEARER_TOKEN, to properly
configure the off-cluster authentication contract and prevent console API auth
failures.
start-console.sh-99-100 (1)

99-100: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid eval when consuming resolve-console-image.sh output.

Line 99 executes arbitrary shell text if the helper output is ever malformed or tainted. Parse CONSOLE_IMAGE= explicitly instead of evaluating.

Suggested fix
-eval "$(bash ./ci-scripts/resolve-console-image.sh)" || true
+resolved_console_image="$(bash ./ci-scripts/resolve-console-image.sh 2>/dev/null || true)"
+case "${resolved_console_image}" in
+  CONSOLE_IMAGE=*) CONSOLE_IMAGE="${resolved_console_image#CONSOLE_IMAGE=}" ;;
+esac
 CONSOLE_IMAGE="${CONSOLE_IMAGE:-quay.io/openshift/origin-console:latest}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@start-console.sh` around lines 99 - 100, The eval statement on line 99 poses
a security risk by executing arbitrary shell code if the output from
resolve-console-image.sh is malformed or tainted. Replace the eval approach by
capturing the output of the resolve-console-image.sh script into a variable and
then explicitly extracting the CONSOLE_IMAGE value from that output using grep
or parameter expansion, rather than evaluating it as shell code. This ensures
only the CONSOLE_IMAGE assignment is processed while safely handling any
potentially malicious or malformed output.
ci-scripts/install-hco.sh-141-145 (1)

141-145: ⚠️ Potential issue | 🟠 Major

Pin HPP manifest source to an immutable git ref (tag/SHA).

The configured default release-v0.21 is a mutable branch reference, making installs non-reproducible and vulnerable to upstream drift between runs. Use a specific tag or commit SHA instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/install-hco.sh` around lines 141 - 145, The HPP_VERSION variable
is currently set to a mutable branch reference (release-v0.21) which causes
non-reproducible installations and vulnerability to upstream drift. Change the
HPP_VERSION variable to use an immutable git reference instead, such as a
specific semantic version tag (like v0.21.0) or a commit SHA. This will ensure
that both the hostpathprovisioner_cr.yaml and storageclass-wffc-csi.yaml
manifests are fetched from a stable, unchanging source across multiple
installation runs.
ci-scripts/install-hco.sh-147-149 (1)

147-149: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Default StorageClass mutation can leave the cluster without any default.

Line 147 clears defaults globally before confirming hostpath-csi is present/writable. A failure on Line 148 can break subsequent namespace provisioning on the shared cluster.

Suggested fix
-  oc annotate storageclasses --all storageclass.kubernetes.io/is-default-class- || true
-  oc annotate storageclass hostpath-csi storageclass.kubernetes.io/is-default-class='true'
+  if ! oc get storageclass hostpath-csi >/dev/null 2>&1; then
+    echo "ERROR: hostpath-csi StorageClass not found; refusing to clear existing defaults"
+    exit 1
+  fi
+  oc annotate storageclass hostpath-csi storageclass.kubernetes.io/is-default-class='true' --overwrite
+  oc get storageclass -o name | grep -v '^storageclass.storage.k8s.io/hostpath-csi$' | \
+    xargs -r oc annotate storageclass storageclass.kubernetes.io/is-default-class- || true
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/install-hco.sh` around lines 147 - 149, The issue is that line 147
removes all default StorageClass annotations globally, and if the subsequent
command on line 148 (which sets hostpath-csi as the default) fails, the cluster
is left without any default StorageClass, breaking namespace provisioning.
Before removing all default annotations on line 147, first verify that the
hostpath-csi StorageClass exists and is accessible. Reorder the commands so that
hostpath-csi availability is confirmed and set as default before clearing the
global defaults, or add error handling to ensure a default StorageClass is
always maintained even if the hostpath-csi annotation command fails.
ci-scripts/start-console.sh-94-99 (1)

94-99: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

CONSOLE_IMAGE fallback is missing after resolver failure.

Line 96 swallows resolver errors, but Line 98 immediately hard-fails if CONSOLE_IMAGE is unset. That contradicts the fallback behavior documented in the block comment.

Suggested fix
 eval "$(bash "${SCRIPT_DIR}/resolve-console-image.sh")" || true
+CONSOLE_IMAGE="${CONSOLE_IMAGE:-quay.io/openshift/origin-console:latest}"
 CONSOLE_PORT=${CONSOLE_PORT:-9000}
-: "${CONSOLE_IMAGE:?CONSOLE_IMAGE must be set or resolved before starting the console}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/start-console.sh` around lines 94 - 99, The
resolve-console-image.sh script call suppresses errors with || true, but there
is no actual fallback value set for CONSOLE_IMAGE when that script fails,
causing the parameter expansion check on line 98 to hard-fail. This contradicts
the documented fallback behavior in the block comment that mentions falling back
to :latest on failure. Add a fallback assignment for CONSOLE_IMAGE (such as
setting it to a base image name with the :latest tag) that will be used when the
resolver script does not successfully set the variable, ensuring the fallback
behavior matches the documented intent.
ci-scripts/check-cluster-health.sh-88-97 (1)

88-97: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

ARC listener check can false-pass on unrelated running pods.

Lines 89-92 count all Running pods in ARC_RUNNERS_NS; any unrelated pod can satisfy this check even if the listener is unhealthy, weakening the E2E gate.

Suggested fix
 check "ARC listener pod in ${ARC_RUNNERS_NS}" bash -c "
-  running=\$(oc get pods -n '${ARC_RUNNERS_NS}' --no-headers 2>/dev/null | grep -c 'Running')
+  running=\$(oc get pods -n '${ARC_RUNNERS_NS}' --no-headers 2>/dev/null | awk '\$1 ~ /listener/ && \$3 == \"Running\" {c++} END {print c+0}')
   if [[ \"\${running}\" -ge 1 ]]; then
     echo \"  \${running} Running pod(s) (listener/controller)\"
     exit 0
   else
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/check-cluster-health.sh` around lines 88 - 97, The ARC listener
pod check in the check function is counting all running pods in the
ARC_RUNNERS_NS namespace, which can false-pass if any unrelated pod is running
even when the listener pod itself is down. Replace the generic grep pattern that
counts all "Running" pods with a more specific filter that only matches the
actual ARC listener or controller pod by name pattern (such as grepping for a
pod name prefix like "arc-runner-set" or the actual listener pod naming
convention). This ensures the check specifically validates the listener pod is
running, not just any pod in that namespace.
ci-scripts/resolve-console-image.sh-15-18 (1)

15-18: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve :latest fallback when ClusterVersion is unavailable.

Line 15-18 hard-fails, but the controller-side resolver uses CONSOLE_IMAGE_REGISTRY:latest fallback. This contract mismatch can fail console startup during transient cluster-version lookup errors.

Suggested fix
 if [[ -z "${VERSION}" ]]; then
-  echo "::error::Could not read .status.desired.version from ClusterVersion 'version'. Is this an OpenShift cluster and is oc authenticated?" >&2
-  exit 1
+  CONSOLE_IMAGE="${CONSOLE_IMAGE_REGISTRY}:latest"
+  echo "::warning::Could not read .status.desired.version from ClusterVersion 'version'. Falling back to ${CONSOLE_IMAGE}." >&2
+  echo "CONSOLE_IMAGE=${CONSOLE_IMAGE}"
+  exit 0
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/resolve-console-image.sh` around lines 15 - 18, The script
hard-fails when the VERSION variable cannot be read from ClusterVersion, but the
controller-side resolver has a fallback mechanism to use
CONSOLE_IMAGE_REGISTRY:latest. Instead of exiting with an error in the if block
that checks for empty VERSION, modify the script to allow graceful fallback
behavior. Either set VERSION to a default value (like latest) when the
ClusterVersion lookup fails, or remove the hard exit and allow the script to
continue with a fallback, ensuring the script's behavior is consistent with the
controller's contract of using :latest as a fallback during transient
cluster-version lookup errors.

kubevirt-plugin-image: ${{ env.KUBEVIRT_PLUGIN_IMAGE }}
steps:
- name: Checkout
uses: actions/checkout@v6

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/hot-cluster-e2e-run.yml | sed -n '170,180p; 185,195p; 230,240p; 290,300p; 328,338p'

Repository: kubevirt-ui/kubevirt-plugin

Length of output: 2332


🌐 Web query:

GitHub Actions security best practice commit SHA pinning vs version tags supply chain

💡 Result:

Pinning GitHub Actions to a full-length commit SHA is a primary security best practice to prevent supply chain attacks, as it ensures the action references an immutable version of the code [1][2][3]. Comparison of Pinning Methods - Commit SHA Pinning (Recommended): Provides an immutable reference. Even if an attacker gains access to the action's repository and updates an existing tag (a "tag retargeting" attack), a workflow pinned to a specific SHA will continue to execute the original, trusted code [3][4][5]. - Version Tags (e.g., @v1, @v1.2): These are mutable references. If an action's maintainer account is compromised, attackers can move or delete these tags to point to malicious code, which will then be automatically used by any workflow referencing those tags [1][4][6]. While more convenient, tags should only be used if you fully trust the creator and are aware of the inherent risks [1][7]. Supply Chain Security Nuance While SHA pinning protects against tag manipulation, it is not a complete solution: - Immutable but not invulnerable: Pinning only ensures the referenced commit remains unchanged. If the action code itself is malicious (e.g., an attacker submits a malicious PR that is merged and then pinned), pinning will not prevent the execution of that code [4]. - Runtime behaviors: Some actions, even when pinned to a SHA, may fetch additional dependencies or data at runtime (e.g., via curl, npm, or docker pull) that are unversioned or dynamic (like latest), effectively bypassing the protection of the initial SHA pin [5][8]. - Maintenance: Pinned SHAs require manual or automated updates to receive bug fixes and security patches. Tools such as Renovate, pinact, ghavm, or pinprick are commonly used to automate this maintenance by keeping pinned SHAs up to date with the latest releases [4][5][8][9][10][6]. Policy and Governance GitHub supports enforcing SHA pinning at the repository and organization level [1][2][5]. When enabled, this policy causes any workflow that attempts to reference an action without a full-length commit SHA to fail, ensuring consistent adherence to security standards across an entire organization [2][5][6]. Best Practice Recommendation To maximize security: 1. Pin all third-party actions to full-length commit SHAs [5][6]. 2. Maintain human-readable context by including the original version tag as a comment in the workflow file (e.g., uses: actions/checkout@sha # v4.0.0) [11][6]. 3. Automate the updates of these pins using tools like Renovate or Dependabot to ensure you continue to receive security updates [4][6]. 4. Audit actions for "runtime fetch" patterns that might bypass static pinning [5][8].

Citations:


Pin all external actions to full-length commit SHAs to prevent supply chain attacks.

actions/checkout@v6, docker/metadata-action@v6, and actions/upload-artifact@v6 use mutable version tags instead of immutable commit SHAs. If an action maintainer's account is compromised, attackers can retarget version tags to point to malicious code, automatically executing the compromised version in all workflows using those tags. Replace with commit SHAs (e.g., actions/checkout@abc1234def567890), optionally with the version tag as a comment for clarity. Consider automating updates using Renovate or Dependabot.

Applies to lines: 176, 190, 236, 297, 333

🧰 Tools
🪛 zizmor (1.25.2)

[warning] 175-176: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 176-176: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/hot-cluster-e2e-run.yml at line 176, Replace all mutable
action version tags with their corresponding full-length commit SHAs to prevent
supply chain attacks. In the workflow file, locate all instances of
`actions/checkout@v6`, `docker/metadata-action@v6`, and
`actions/upload-artifact@v6` (found at lines 176, 190, 236, 297, and 333) and
replace the `@v6` or version tag portion with the full commit SHA for each
action. You can optionally include the version tag as a comment for clarity
(e.g., `actions/checkout@abc1234def567890 # v6`). Verify each commit SHA is
correct by checking the official repository for each action.

Source: Linters/SAST tools

Comment thread ci-scripts/README.md Outdated
#
# Start the "off cluster" console. Based on the `route-console.sh` and `start-console.sh` scripts.
#
set -euox pipefail

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

set -x exposes the bearer token before masking.

Line 5 enables xtrace, so the token assignment at Line 81 is logged before Line 86 applies masking. This leaks cluster credentials to CI logs.

Suggested fix
-set -euox pipefail
+set -euo pipefail
+# Optional debug:
+# [[ "${DEBUG:-false}" == "true" ]] && set -x

Also applies to: 81-87, 134-149

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-scripts/start-console.sh` at line 5, The xtrace flag enabled by the `-x`
option in the set command on line 5 causes all commands and their arguments to
be printed to output before execution. This exposes the bearer token assignment
at line 81 to CI logs before the masking logic at line 86 can hide it. Remove
the `-x` flag from the set command so xtrace is disabled and the token
assignment will not be logged to the output before masking is applied.

Comment thread cypress/tests/poc-gating.cy.ts Outdated
galkremer1 added a commit to galkremer1/kubevirt-plugin that referenced this pull request Jun 17, 2026
Manual-only setup/teardown workflows plus ci-scripts so cluster
provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099).
Does not change existing CI triggers or behavior.
galkremer1 added a commit to galkremer1/kubevirt-plugin that referenced this pull request Jun 17, 2026
Manual-only setup/teardown workflows plus ci-scripts so cluster
provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099).
Does not change existing CI triggers or behavior.
galkremer1 added a commit to galkremer1/kubevirt-plugin that referenced this pull request Jun 17, 2026
Manual-only setup/teardown workflows plus ci-scripts so cluster
provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099).
Does not change existing CI triggers or behavior.
galkremer1 added a commit to galkremer1/kubevirt-plugin that referenced this pull request Jun 18, 2026
Manual-only setup/teardown workflows plus ci-scripts so cluster
provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099).
Does not change existing CI triggers or behavior.
galkremer1 added a commit to galkremer1/kubevirt-plugin that referenced this pull request Jun 18, 2026
Manual-only setup/teardown workflows plus ci-scripts so cluster
provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099).
Does not change existing CI triggers or behavior.
galkremer1 added a commit to galkremer1/kubevirt-plugin that referenced this pull request Jun 18, 2026
Manual-only setup/teardown workflows plus ci-scripts so cluster
provisioning can run from main before the full hot-cluster PR (kubevirt-ui#4099).
Does not change existing CI triggers or behavior.
sjd78 and others added 13 commits June 23, 2026 16:24
Establish a basic GitHub action based CI workflow
for running e2e tests on a hot-cluster. POC workflows
include setting up IBM Cloud ROKS clusters.

Full documentation is in the `POC_HOT_CLUSTER_CI.md`
and referenced files.

The hot-cluster can be any OpenShift cluster (FIPS and non-FIPS) and
is based on the GitHub Actions runner controller (ARC). It is installed
via helm, and only requires network access to pull from GitHub within
the cluster itself. The POC can even be run on a local CRC / OpenShift
local development cluster without any special networking configuration.

The self-hosted runners are ephemeral and are deleted
after the CI job is complete. Full setup documentation is in the
`ci-scripts/README.md` file.

Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
`ci-env` controller:
  - manage CI test environments using ConfigMaps
  - GitHub actions to request and release the CI test environment
    via ConfigMaps

`ci-test-stack`:
  - helm chart to configure and deploy the console and plugin
  - console and plugin are deployed into the test environment namespace
  - access from the workflow is via the service endpoints
  - external route is available for debugging

test2 workflow updates to use the GitHub actions

Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
…e controller sh

Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Correct "POC Hot ClusterE2E CI Test" to "POC Hot Cluster E2E CI Test"
(missing space before E2E) in both the workflow-to-name table and the
Variant A usage steps. Matches the actual name: field in
.github/workflows/poc-e2e-ci-test.yml.

Made-with: Cursor
Remove the sequential dependency between check-runner and
build-kubevirt-plugin-image. The diagnostics job produces no outputs
consumed by the build, so there is no functional coupling. Gate
run-gating-tests on both jobs so it still waits for runner validation
before provisioning the test environment.

This saves the full duration of check-runner (~1-2 min) from the
critical path on every run.

Made-with: Cursor
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
Signed-off-by: Scott J Dickerson <sdickers@redhat.com>
    - Remove duplicate poc-e2e-ci-test.yml; keep hot-cluster-e2e.yml as entry
    - Rename poc-e2e-ci-test2.yml to hot-cluster-e2e-run.yml
    - Delete POC_HOT_CLUSTER_CI.md; move backlog to docs/HOT_CLUSTER_FUTURE_WORK.md
    - Add docs/HOT_CLUSTER_CI_CONTINUATION.md and docs/CLUSTER_LIFECYCLE.md
    - Use IC_KEY secret, secure oc install, Playwright E2E updates
    - Update ci-scripts/README and auto-teardown workflow references
galkremer1 and others added 10 commits June 24, 2026 12:48
Instead of failing when cos_instance_crn is empty, look for an existing
COS instance named <cluster_name>-cos, or create one automatically.
This makes the VPC path work out-of-the-box without pre-setup.

Co-authored-by: Cursor <cursoragent@cursor.com>
ibmcloud resource service-instance-create does not support --output json.
Create first, then query the CRN with service-instances --output json.

Co-authored-by: Cursor <cursoragent@cursor.com>
The 'standard' plan no longer works; use 'premium-global-deployment'
which is the first option shown by the CLI.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use the plan ID (744bfc56-d12c-4866-88d5-dac9139e0e5d) instead of the
plan name to avoid interactive prompts or name resolution issues.

Co-authored-by: Cursor <cursoragent@cursor.com>
The standard plan ID resolves to two deployments (premium-global-deployment,
premium-global-deployment-iam). Use -d premium-global-deployment to select
non-interactively.

Co-authored-by: Cursor <cursoragent@cursor.com>
Probes VPC, COS, DNS Services, CIS, IAM Identity, resource groups,
and authorization policies to confirm readiness for OpenShift IPI
on IBM Cloud VPC. Run with INFRASTRUCTURE_TYPE=ipi.

Co-authored-by: Cursor <cursoragent@cursor.com>
Select infrastructure_type=ipi to run IPI prerequisite checks
(VPC, COS, DNS, IAM Identity, resource groups, auth policies)
without creating any resources. All common steps are skipped.

Co-authored-by: Cursor <cursoragent@cursor.com>
When infrastructure_type=ipi, downloads openshift-install 4.20.8,
generates install-config.yaml for IBM Cloud VPC, and attempts
cluster creation. Includes automatic cleanup on failure and artifact
upload for debugging. Expects OPENSHIFT_PULL_SECRET secret.

Co-authored-by: Cursor <cursoragent@cursor.com>
openshift-install uses hyphen-separated VPC instance profiles (bx2-4x16)
while ibmcloud oc uses dot-separated (bx2.4x16). Convert automatically.

Co-authored-by: Cursor <cursoragent@cursor.com>
Summarizes the Jun 24 testing session: three provisioning paths
(classic, VPC ROKS, IPI), what works, what's blocked, and exact
next steps for tomorrow after DNS/COS authorization is configured.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ CI Failure Escalation

Automated regex triage could not classify this failure.

@coderabbitai Please analyze the build logs linked below.
If you determine the root cause is an infrastructure issue, network timeout,
or environment setup flake, please reply with exactly /retest on a new line.
Otherwise, explain the code failure.

Build Logs

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the build log - first try with curl to get a portion of it
curl -s --max-time 30 "https://storage.googleapis.com/test-platform-results/pr-logs/pull/kubevirt-ui_kubevirt-plugin/4099/pull-ci-kubevirt-ui-kubevirt-plugin-main-kubevirt-e2e-aws/2069836692671434752/build-log.txt" | tail -300

Length of output: 15132



The failure is an infrastructure/environment flake, not a code issue introduced by this PR.

Root cause: After successfully provisioning a fresh AWS cluster (~57 min IPI install) and creating the HCO catalog source + subscription, all nine HCO-managed operator deployments timed out simultaneously:

timed out waiting for the condition on deployments/aaq-operator
timed out waiting for the condition on deployments/cdi-operator
timed out waiting for the condition on deployments/cluster-network-addons-operator
timed out waiting for the condition on deployments/hco-operator
timed out waiting for the condition on deployments/hco-webhook
timed out waiting for the condition on deployments/hostpath-provisioner-operator
timed out waiting for the condition on deployments/hyperconverged-cluster-cli-download
timed out waiting for the condition on deployments/ssp-operator
timed out waiting for the condition on deployments/virt-operator

This is a well-known OLM/HCO rollout flake in Prow's AWS CI — the operators failed to become available within the wait window on a fresh cluster. The PR's changes are entirely GitHub Actions workflows, Helm charts, and CI scripts; none touch HCO operator deployment logic.

/retest

galkremer1 and others added 6 commits June 24, 2026 15:34
Domain is registered with IBM Cloud CIS nameservers.
IPI will create api.kubevirt-plugin-ci.cnv-ui.com and
*.apps.kubevirt-plugin-ci.cnv-ui.com DNS records.

Co-authored-by: Cursor <cursoragent@cursor.com>
The IC_KEY identity cannot create resource groups. Tell the installer
to use the existing cnv-ui group instead of creating a new one.

Co-authored-by: Cursor <cursoragent@cursor.com>
Setup workflow:
- IPI now leaves cluster running on success (cleanup only on failure)
- All bootstrap steps (HCO, ARC, ci-env) run for all infra types
- IPI uses installer kubeconfig instead of ibmcloud oc
- Fixed install-config.yaml heredoc indentation
- Added credentialsMode: Manual + ccoctl manifest generation
- OpenShift version configurable for IPI (stable-4.XX channel)
- Default infrastructure_type changed to ipi

Teardown workflow:
- Added infrastructure_type input (ipi/vpc/classic)
- IPI teardown downloads install state and runs openshift-install destroy

Auto-teardown:
- Detects IPI clusters via DNS probe
- Added business_hours_only option (Sun-Thu 8-18 Israel)
- Idle threshold shown in summary

Cleanup:
- Deleted POC files and status doc
- Consolidated README with IPI docs

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Move install-config.yaml content to a standalone template file
(ci-scripts/ipi-install-config.yaml.tpl) and use envsubst to
substitute variables. Fixes the workflow YAML validation error
that prevented workflow_dispatch from working.

Co-authored-by: Cursor <cursoragent@cursor.com>
oc adm release extract needs the oc binary which was not available
during the CCO manifest step. Download oc, kubectl, and ccoctl
alongside openshift-install so credentials can be properly generated.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ CI Failure Escalation

Automated regex triage could not classify this failure.

@coderabbitai Please analyze the build logs linked below.
If you determine the root cause is an infrastructure issue, network timeout,
or environment setup flake, please reply with exactly /retest on a new line.
Otherwise, explain the code failure.

Build Logs

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

url="https://storage.googleapis.com/test-platform-results/pr-logs/pull/kubevirt-ui_kubevirt-plugin/4099/pull-ci-kubevirt-ui-kubevirt-plugin-main-kubevirt-e2e-aws/2069892807979438080/build-log.txt"

echo "=== LAST 250 LINES ==="
curl -fsSL --max-time 60 "$url" | tail -250

echo
echo "=== FAILURE KEYWORDS ==="
curl -fsSL --max-time 60 "$url" \
  | rg -n -C3 'level=fatal|ERROR|Error:|FAIL|failed|timed out|context deadline exceeded|connection reset|EOF|No such host|Temporary failure|i/o timeout|dial tcp|TLS handshake timeout|panic:' || true

Length of output: 23624



@github-actions[bot]

This does not look like an infrastructure or environment flake.

The job failed because a Playwright assertion never became true in the test flow for creating an example template from YAML:

  • Failing test: playwright/tests/gating/check-tab-yaml.spec.ts:200
  • Failing assertion: playwright/pages/ResourceListPage.ts:68
  • Failure:
    • expect(locator).toBeVisible() failed
    • Locator: getByText('name: example')
    • Error: element(s) not found

The failure reproduced on the retry as well, which points to a real code/test issue rather than a transient cluster problem.

There was an earlier temporary timeout while waiting for kubevirt-console-plugin pods to roll out, but the rollout later recovered and the suite continued. The terminal failure is the missing YAML content/UI state during TemplatesPage.createFromYAML(), not cluster provisioning or networking.

galkremer1 and others added 5 commits June 24, 2026 19:51
Co-authored-by: Cursor <cursoragent@cursor.com>
Add a pre-step that finds and deletes VMs, load balancers, and DNS
records from previous IPI installs with the same cluster name prefix.
Also adds a one-off IPI destroy workflow for manual cleanup.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ter name

The previous cleanup deleted all VMs/LBs matching the cluster name prefix,
which would prevent parallel clusters. Now only cleans stale DNS records
(the actual blocker for re-installs) and scopes to the exact cluster name.
VMs and LBs are managed by openshift-install destroy, not brute-force.

Co-authored-by: Cursor <cursoragent@cursor.com>
…le it

The IC_KEY doesn't have iam.policy.create permission so ccoctl fails.
Remove credentialsMode: Manual and the CCO manifests step entirely.
Let openshift-install create cluster handle credentials using the
API key directly. Also fixes the auth/ dir missing issue (create
manifests was consuming install-config before create cluster).

Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@galkremer1: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/kubevirt-e2e-aws 9d29c5f link true /test kubevirt-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ CI Failure Escalation

Automated regex triage could not classify this failure.

@coderabbitai Please analyze the build logs linked below.
If you determine the root cause is an infrastructure issue, network timeout,
or environment setup flake, please reply with exactly /retest on a new line.
Otherwise, explain the code failure.

Build Logs

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

url="https://storage.googleapis.com/test-platform-results/pr-logs/pull/kubevirt-ui_kubevirt-plugin/4099/pull-ci-kubevirt-ui-kubevirt-plugin-main-kubevirt-e2e-aws/2069933323295133696/build-log.txt"

echo "=== LAST 300 LINES ==="
curl -fsSL --max-time 90 "$url" | tail -300

echo
echo "=== FAILURE KEYWORDS (with context) ==="
curl -fsSL --max-time 90 "$url" \
  | rg -n -C4 'level=fatal|ERROR|Error:|FAIL|failed|timed out|context deadline exceeded|connection reset|EOF|No such host|Temporary failure|i/o timeout|dial tcp|TLS handshake timeout|panic:|assertion|expect\(' || true

Length of output: 45429



@github-actions[bot]

/retest

galkremer1 and others added 2 commits June 24, 2026 21:49
IBM Cloud IPI requires credentialsMode: Manual but ccoctl needs
iam.policy.create which we don't have. Instead, generate the
ibm-cloud-credentials secret manifests directly using the API key
for all required namespaces (CCM, machine-api, image-registry,
ingress, CSI). This fixes the CCM FailedMount error that prevented
node initialization and caused the entire cluster to be broken.

Co-authored-by: Cursor <cursoragent@cursor.com>
The cleanup step was positioned before the common bootstrap steps,
so it only ran when IPI-specific steps failed. Now it runs at the
end of the job and destroys the cluster if ANY step failed (HCO,
ARC, health check, etc.). No more orphaned clusters from partial
setup failures.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved This issue is something we want to fix jira/valid-reference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants