Skip to content

test: add E2E uninstall test for MaaS infrastructure teardown [RHOAIENG-62678]#911

Open
jira-autofix[bot] wants to merge 9 commits into
mainfrom
jira-autofix/RHOAIENG-62678
Open

test: add E2E uninstall test for MaaS infrastructure teardown [RHOAIENG-62678]#911
jira-autofix[bot] wants to merge 9 commits into
mainfrom
jira-autofix/RHOAIENG-62678

Conversation

@jira-autofix

@jira-autofix jira-autofix Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Add automated E2E test that verifies deleting the MaaS Config CR and
parent operator top-level CRs (DataScienceCluster, DSCInitialization)
fully removes all MaaS-owned infrastructure. This prevents orphaned
controllers, workloads, routes, and namespaced CRs from surviving
uninstall, which complicates upgrades, reuse, and compliance reviews.

The test executes an ordered delete sequence (Config → DSC → DSCI),
then asserts within a bounded timeout that no MaaS CRD instances,
workloads, or HTTPRoutes remain. On failure it dumps all remaining
resources for debugging. The test runs as Phase 4 in the Prow smoke
pipeline, after all functional E2E tests and deployment validation.

Closes RHOAIENG-62678

Co-Authored-By: Claude noreply@anthropic.com
Signed-off-by: Jamie Land jland@redhat.com

Summary by CodeRabbit

  • Tests
    • Added an end-to-end uninstall test that performs a destructive teardown and verifies no remaining MaaS CRs, controller/api workloads, routes, or subscriptions; emits a diagnostic dump if residual resources remain.
  • Chores
    • CI smoke flow now includes a final destructive "Uninstall" phase after deployment validation (skippable); prepares the test environment, runs the uninstall test, and writes test artifacts to the CI artifact directory.

…NG-62678]

Add automated E2E test that verifies deleting the MaaS Config CR and
parent operator top-level CRs (DataScienceCluster, DSCInitialization)
fully removes all MaaS-owned infrastructure. This prevents orphaned
controllers, workloads, routes, and namespaced CRs from surviving
uninstall, which complicates upgrades, reuse, and compliance reviews.

The test executes an ordered delete sequence (Config → DSC → DSCI),
then asserts within a bounded timeout that no MaaS CRD instances,
workloads, or HTTPRoutes remain. On failure it dumps all remaining
resources for debugging. The test runs as Phase 4 in the Prow smoke
pipeline, after all functional E2E tests and deployment validation.

Closes RHOAIENG-62678

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Jamie Land <jland@redhat.com>
@openshift-ci-robot

openshift-ci-robot commented May 15, 2026

Copy link
Copy Markdown
Collaborator

@jira-autofix[bot]: This pull request references RHOAIENG-62678 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "rhoai-3.5" instead.

Details

In response to this:

Add automated E2E test that verifies deleting the MaaS Config CR and
parent operator top-level CRs (DataScienceCluster, DSCInitialization)
fully removes all MaaS-owned infrastructure. This prevents orphaned
controllers, workloads, routes, and namespaced CRs from surviving
uninstall, which complicates upgrades, reuse, and compliance reviews.

The test executes an ordered delete sequence (Config → DSC → DSCI),
then asserts within a bounded timeout that no MaaS CRD instances,
workloads, or HTTPRoutes remain. On failure it dumps all remaining
resources for debugging. The test runs as Phase 4 in the Prow smoke
pipeline, after all functional E2E tests and deployment validation.

Closes RHOAIENG-62678

Co-Authored-By: Claude noreply@anthropic.com
Signed-off-by: Jamie Land jland@redhat.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from SB159 and somya-bhatnagar May 15, 2026 14:42
@openshift-ci

openshift-ci Bot commented May 15, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jira-autofix[bot]
Once this PR has been reviewed and has the lgtm label, please assign jland-redhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci

openshift-ci Bot commented May 15, 2026

Copy link
Copy Markdown

Hi @jira-autofix[bot]. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a destructive end-to-end uninstall pytest module that deletes finalizer-bearing MaaS CRs, MaaS Config, DataScienceCluster, and DSCInitialization, then polls for garbage collection. Implements oc subprocess helpers, resource collection/formatting utilities, an autouse fixture orchestrating uninstall, multiple assertions that no MaaS CRs/workloads/HTTPRoutes/subscription CRs remain, and a diagnostic dump on residuals. Adds run_uninstall_test() to the Prow smoke script to prepare the e2e venv, run pytest for test_uninstall.py, produce HTML/JUnit artifacts, and a Phase 4 CI control flow that conditionally skips or runs the destructive test.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Security & Operational Notes

  • CWE-200: Diagnostic dump may expose cluster resource contents in CI artifacts. Restrict artifact access and sanitize outputs before publishing.
  • CWE-78: _oc uses subprocess; ensure no user-controlled input is concatenated into command arguments. Use fixed argument lists and validate any env-derived identifiers.
  • CWE-703: _delete_resource currently logs non-zero deletes instead of failing; convert to explicit retries/failures or escalate to avoid silent teardown gaps.
  • CWE-755 / Operational fragility: Tests delete finalizer-bearing CRs while controller is live; confirm safe ordering and consider using graceful controller quiesce or explicit finalizer removal with documented justification.
  • Configuration: Expose UNINSTALL_TIMEOUT and POLL_INTERVAL via environment variables to accommodate slow clusters and reduce flakiness.
  • Heuristic detection: HTTPRoute identification uses name/parentRefs heuristics—prefer ownerReferences or explicit labels to reduce false positives/negatives.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the primary change: adding an E2E uninstall test for MaaS infrastructure teardown. It is concise, specific, and clearly conveys the main objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Around line 906-913: The script always invokes run_uninstall_test (the
destructive teardown) regardless of SKIP_DEPLOYMENT, which is unsafe for shared
clusters; update the logic around the Phase 4 block (print_header "Running
Uninstall E2E Test" and run_uninstall_test) to check SKIP_DEPLOYMENT (or an
equivalent gate variable) and skip calling run_uninstall_test when
SKIP_DEPLOYMENT=true, emitting a clear informational message instead; ensure the
check uses the same variable used earlier in the script and preserves existing
behavior when SKIP_DEPLOYMENT is unset/false so only non-deployment runs avoid
the destructive teardown.

In `@test/e2e/tests/test_uninstall.py`:
- Around line 67-73: The helper _oc currently calls subprocess.run with a
relative "oc" which trusts PATH (CWE-426); change it to resolve and use an
absolute oc binary path before invoking subprocess.run (e.g. use
shutil.which("oc") to find the full path or read a configured OC_BINARY env var,
fail fast if not found) and pass that absolute path in place of "oc" in the argv
list inside _oc so the test teardown cannot be hijacked by a poisoned PATH.
- Around line 98-108: The _list_resources() function silently turns unexpected
oc get failures and JSON parse errors into empty lists; change it to only return
[] for the known, harmless cases ("the server doesn't have a resource type" and
"not found") and otherwise raise an exception (e.g., RuntimeError or re-raise
subprocess.CalledProcessError) including result.stderr and result.returncode so
test failures are visible; likewise, on JSONDecodeError do not return [] but
raise an error with the stdout content and the failed command context. Ensure
you update the paths using _list_resources(), so callers expect an exception on
real failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9bf053b1-df35-4a96-a29c-428b02dc08b0

📥 Commits

Reviewing files that changed from the base of the PR and between 7f759eb and 60e6d52.

📒 Files selected for processing (2)
  • test/e2e/scripts/prow_run_smoke_test.sh
  • test/e2e/tests/test_uninstall.py

Comment thread test/e2e/scripts/prow_run_smoke_test.sh Outdated
Comment thread test/e2e/tests/test_uninstall.py
Comment thread test/e2e/tests/test_uninstall.py Outdated
Gate the Phase 4 uninstall test when SKIP_DEPLOYMENT=true to prevent
destructive teardown on shared/existing clusters. Resolve the oc binary
to an absolute path at import time (CWE-426) to avoid PATH injection.
Make _list_resources() raise on unexpected oc failures and JSON parse
errors instead of silently returning empty lists that mask real issues.

Closes RHOAIENG-62678

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Jamie Land <jland@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Line 918: The final success echo always claims uninstall completed even when
the script took the skip path at lines 911-913; modify the script to track
whether uninstall was skipped (e.g., set a flag like UNINSTALL_SKIPPED when
taking the skip branch in the block around lines 911-913) and replace the
unconditional echo "🎉 Deployment and uninstall tests completed successfully!"
with a conditional that prints "🎉 Deployment tests completed; uninstall
skipped." when UNINSTALL_SKIPPED is true and the original success message only
when uninstall actually ran.
- Around line 595-600: The pytest invocation currently uses the --maxfail=1 flag
which makes pytest stop early and can prevent running
test_diagnostic_dump_on_residual after an earlier uninstall failure; remove the
--maxfail=1 flag (or set it to 0/no flag) in the pytest command in the script so
pytest will run the diagnostic dump test even if previous tests fail, keeping
the rest of the options and the target test file tests/test_uninstall.py
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c343a36d-4747-4aa1-81c6-9ce21ccf4e50

📥 Commits

Reviewing files that changed from the base of the PR and between 60e6d52 and aa698d1.

📒 Files selected for processing (2)
  • test/e2e/scripts/prow_run_smoke_test.sh
  • test/e2e/tests/test_uninstall.py

Comment thread test/e2e/scripts/prow_run_smoke_test.sh
Comment thread test/e2e/scripts/prow_run_smoke_test.sh Outdated
…HOAIENG-62678]

Remove --maxfail=1 from uninstall pytest invocation so the diagnostic
dump test runs even when earlier assertions fail, ensuring complete
failure diagnostics. Also fix the final success message to say "Smoke
test phases completed" instead of "Deployment and uninstall tests
completed" so it remains accurate when the uninstall phase is skipped.

Closes RHOAIENG-62678

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Jamie Land <jland@redhat.com>
@jira-autofix jira-autofix Bot changed the title RHOAIENG-62678: E2E test: uninstall removes all MaaS infra when Config and parent top-level CRs are deleted test: add E2E uninstall test for MaaS infrastructure teardown [RHOAIENG-62678] May 15, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Collaborator

@jira-autofix[bot]: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

Details

In response to this:

Add automated E2E test that verifies deleting the MaaS Config CR and
parent operator top-level CRs (DataScienceCluster, DSCInitialization)
fully removes all MaaS-owned infrastructure. This prevents orphaned
controllers, workloads, routes, and namespaced CRs from surviving
uninstall, which complicates upgrades, reuse, and compliance reviews.

The test executes an ordered delete sequence (Config → DSC → DSCI),
then asserts within a bounded timeout that no MaaS CRD instances,
workloads, or HTTPRoutes remain. On failure it dumps all remaining
resources for debugging. The test runs as Phase 4 in the Prow smoke
pipeline, after all functional E2E tests and deployment validation.

Closes RHOAIENG-62678

Co-Authored-By: Claude noreply@anthropic.com
Signed-off-by: Jamie Land jland@redhat.com

Summary by CodeRabbit

  • Tests
  • Added an end-to-end uninstall test that verifies complete teardown: no remaining MaaS CRs, workloads, routes, or subscriptions; collects diagnostics on failure.
  • Chores
  • CI smoke flow now includes a final, destructive "Uninstall" phase run after deployment validation (skipped only when explicitly opted out); prepares the test environment, executes the uninstall test, produces test artifacts, and updates the overall success message.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jira-autofix-bot and others added 2 commits May 17, 2026 02:18
…HOAIENG-62678]

The final success echo now conditionally reports whether the uninstall
test ran or was skipped, so CI output accurately reflects what was
executed.

Closes RHOAIENG-62678

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Jamie Land <jland@redhat.com>
… [RHOAIENG-62678]

The uninstall test was failing because it deleted the Config CR first,
which garbage-collected the maas-controller. Without the controller
running, MaaSModelRefs, MaaSAuthPolicies, and MaaSSubscriptions could
not have their finalizers (maas.opendatahub.io/*-cleanup) processed,
leaving them stuck in Terminating state indefinitely.

Fix the delete sequence to remove user-created CRs with finalizers
first (while the controller is still alive), then delete Config, DSC,
and DSCI.

Closes RHOAIENG-62678

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Jamie Land <jland@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/tests/test_uninstall.py`:
- Line 57: UNINSTALL_TIMEOUT is parsed from env but can be zero or negative
which causes wait loops to skip; after converting to int for UNINSTALL_TIMEOUT
validate that it is a positive integer (>0) and raise a clear ValueError (or
assert) with an explanatory message if the value is <= 0 so misconfiguration
like E2E_UNINSTALL_TIMEOUT=0 fails fast and clearly; update the assignment site
of UNINSTALL_TIMEOUT and any initialization logic that uses it to perform this
check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 0df7dfe2-3f03-4695-a511-02d36b1e472b

📥 Commits

Reviewing files that changed from the base of the PR and between 9487147 and 481b8bd.

📒 Files selected for processing (1)
  • test/e2e/tests/test_uninstall.py

Comment thread test/e2e/tests/test_uninstall.py
jira-autofix-bot and others added 3 commits May 17, 2026 07:16
…AIENG-62678]

The LifecycleReconciler recreates Config/default whenever it is deleted
while the maas-controller Deployment is still running. The previous
delete order (Config before DSC/DSCI) caused Config to be immediately
recreated, leaving Config, Tenant, controller Deployment, and HTTPRoutes
as residuals. Reorder the sequence to delete DSC/DSCI first, wait for
the controller Deployment to be removed, then delete Config. Also
validate that E2E_UNINSTALL_TIMEOUT is positive per review feedback.

Closes RHOAIENG-62678

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Jamie Land <jland@redhat.com>
…RHOAIENG-62678]

In CI the controller is deployed via kustomize, not managed by DSC.
Deleting DSC/DSCI does not remove the controller, so the
LifecycleReconciler keeps recreating Config and all MaaS resources
survive uninstall. The fix waits briefly for operator-driven removal
(for operator-managed clusters), then falls back to explicitly
deleting the maas-controller Deployment before deleting Config.

Closes RHOAIENG-62678

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Jamie Land <jland@redhat.com>
@jland-redhat jland-redhat requested a review from a team as a code owner May 21, 2026 14:38
@rhods-ci-bot

Copy link
Copy Markdown

@jira-autofix[bot]: The following test has Failed:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:maas-group-test-gc7k7

@openshift-ci

openshift-ci Bot commented May 31, 2026

Copy link
Copy Markdown

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants