test: add E2E uninstall test for MaaS infrastructure teardown [RHOAIENG-62678]#911
test: add E2E uninstall test for MaaS infrastructure teardown [RHOAIENG-62678]#911jira-autofix[bot] wants to merge 9 commits into
Conversation
…NG-62678] Add automated E2E test that verifies deleting the MaaS Config CR and parent operator top-level CRs (DataScienceCluster, DSCInitialization) fully removes all MaaS-owned infrastructure. This prevents orphaned controllers, workloads, routes, and namespaced CRs from surviving uninstall, which complicates upgrades, reuse, and compliance reviews. The test executes an ordered delete sequence (Config → DSC → DSCI), then asserts within a bounded timeout that no MaaS CRD instances, workloads, or HTTPRoutes remain. On failure it dumps all remaining resources for debugging. The test runs as Phase 4 in the Prow smoke pipeline, after all functional E2E tests and deployment validation. Closes RHOAIENG-62678 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Jamie Land <jland@redhat.com>
|
@jira-autofix[bot]: This pull request references RHOAIENG-62678 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "rhoai-3.5" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jira-autofix[bot] The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @jira-autofix[bot]. Thanks for your PR. I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a destructive end-to-end uninstall pytest module that deletes finalizer-bearing MaaS CRs, MaaS Config, DataScienceCluster, and DSCInitialization, then polls for garbage collection. Implements oc subprocess helpers, resource collection/formatting utilities, an autouse fixture orchestrating uninstall, multiple assertions that no MaaS CRs/workloads/HTTPRoutes/subscription CRs remain, and a diagnostic dump on residuals. Adds run_uninstall_test() to the Prow smoke script to prepare the e2e venv, run pytest for test_uninstall.py, produce HTML/JUnit artifacts, and a Phase 4 CI control flow that conditionally skips or runs the destructive test. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Security & Operational Notes
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Around line 906-913: The script always invokes run_uninstall_test (the
destructive teardown) regardless of SKIP_DEPLOYMENT, which is unsafe for shared
clusters; update the logic around the Phase 4 block (print_header "Running
Uninstall E2E Test" and run_uninstall_test) to check SKIP_DEPLOYMENT (or an
equivalent gate variable) and skip calling run_uninstall_test when
SKIP_DEPLOYMENT=true, emitting a clear informational message instead; ensure the
check uses the same variable used earlier in the script and preserves existing
behavior when SKIP_DEPLOYMENT is unset/false so only non-deployment runs avoid
the destructive teardown.
In `@test/e2e/tests/test_uninstall.py`:
- Around line 67-73: The helper _oc currently calls subprocess.run with a
relative "oc" which trusts PATH (CWE-426); change it to resolve and use an
absolute oc binary path before invoking subprocess.run (e.g. use
shutil.which("oc") to find the full path or read a configured OC_BINARY env var,
fail fast if not found) and pass that absolute path in place of "oc" in the argv
list inside _oc so the test teardown cannot be hijacked by a poisoned PATH.
- Around line 98-108: The _list_resources() function silently turns unexpected
oc get failures and JSON parse errors into empty lists; change it to only return
[] for the known, harmless cases ("the server doesn't have a resource type" and
"not found") and otherwise raise an exception (e.g., RuntimeError or re-raise
subprocess.CalledProcessError) including result.stderr and result.returncode so
test failures are visible; likewise, on JSONDecodeError do not return [] but
raise an error with the stdout content and the failed command context. Ensure
you update the paths using _list_resources(), so callers expect an exception on
real failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9bf053b1-df35-4a96-a29c-428b02dc08b0
📒 Files selected for processing (2)
test/e2e/scripts/prow_run_smoke_test.shtest/e2e/tests/test_uninstall.py
Gate the Phase 4 uninstall test when SKIP_DEPLOYMENT=true to prevent destructive teardown on shared/existing clusters. Resolve the oc binary to an absolute path at import time (CWE-426) to avoid PATH injection. Make _list_resources() raise on unexpected oc failures and JSON parse errors instead of silently returning empty lists that mask real issues. Closes RHOAIENG-62678 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Jamie Land <jland@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Line 918: The final success echo always claims uninstall completed even when
the script took the skip path at lines 911-913; modify the script to track
whether uninstall was skipped (e.g., set a flag like UNINSTALL_SKIPPED when
taking the skip branch in the block around lines 911-913) and replace the
unconditional echo "🎉 Deployment and uninstall tests completed successfully!"
with a conditional that prints "🎉 Deployment tests completed; uninstall
skipped." when UNINSTALL_SKIPPED is true and the original success message only
when uninstall actually ran.
- Around line 595-600: The pytest invocation currently uses the --maxfail=1 flag
which makes pytest stop early and can prevent running
test_diagnostic_dump_on_residual after an earlier uninstall failure; remove the
--maxfail=1 flag (or set it to 0/no flag) in the pytest command in the script so
pytest will run the diagnostic dump test even if previous tests fail, keeping
the rest of the options and the target test file tests/test_uninstall.py
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c343a36d-4747-4aa1-81c6-9ce21ccf4e50
📒 Files selected for processing (2)
test/e2e/scripts/prow_run_smoke_test.shtest/e2e/tests/test_uninstall.py
…HOAIENG-62678] Remove --maxfail=1 from uninstall pytest invocation so the diagnostic dump test runs even when earlier assertions fail, ensuring complete failure diagnostics. Also fix the final success message to say "Smoke test phases completed" instead of "Deployment and uninstall tests completed" so it remains accurate when the uninstall phase is skipped. Closes RHOAIENG-62678 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Jamie Land <jland@redhat.com>
|
@jira-autofix[bot]: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
…HOAIENG-62678] The final success echo now conditionally reports whether the uninstall test ran or was skipped, so CI output accurately reflects what was executed. Closes RHOAIENG-62678 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Jamie Land <jland@redhat.com>
… [RHOAIENG-62678] The uninstall test was failing because it deleted the Config CR first, which garbage-collected the maas-controller. Without the controller running, MaaSModelRefs, MaaSAuthPolicies, and MaaSSubscriptions could not have their finalizers (maas.opendatahub.io/*-cleanup) processed, leaving them stuck in Terminating state indefinitely. Fix the delete sequence to remove user-created CRs with finalizers first (while the controller is still alive), then delete Config, DSC, and DSCI. Closes RHOAIENG-62678 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Jamie Land <jland@redhat.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/tests/test_uninstall.py`:
- Line 57: UNINSTALL_TIMEOUT is parsed from env but can be zero or negative
which causes wait loops to skip; after converting to int for UNINSTALL_TIMEOUT
validate that it is a positive integer (>0) and raise a clear ValueError (or
assert) with an explanatory message if the value is <= 0 so misconfiguration
like E2E_UNINSTALL_TIMEOUT=0 fails fast and clearly; update the assignment site
of UNINSTALL_TIMEOUT and any initialization logic that uses it to perform this
check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 0df7dfe2-3f03-4695-a511-02d36b1e472b
📒 Files selected for processing (1)
test/e2e/tests/test_uninstall.py
…AIENG-62678] The LifecycleReconciler recreates Config/default whenever it is deleted while the maas-controller Deployment is still running. The previous delete order (Config before DSC/DSCI) caused Config to be immediately recreated, leaving Config, Tenant, controller Deployment, and HTTPRoutes as residuals. Reorder the sequence to delete DSC/DSCI first, wait for the controller Deployment to be removed, then delete Config. Also validate that E2E_UNINSTALL_TIMEOUT is positive per review feedback. Closes RHOAIENG-62678 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Jamie Land <jland@redhat.com>
…RHOAIENG-62678] In CI the controller is deployed via kustomize, not managed by DSC. Deleting DSC/DSCI does not remove the controller, so the LifecycleReconciler keeps recreating Config and all MaaS resources survive uninstall. The fix waits briefly for operator-driven removal (for operator-managed clusters), then falls back to explicitly deleting the maas-controller Deployment before deleting Config. Closes RHOAIENG-62678 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Jamie Land <jland@redhat.com>
|
@jira-autofix[bot]: The following test has Failed: OCI Artifact Browser URLInspecting Test Artifacts ManuallyTo inspect your test artifacts manually, follow these steps:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:maas-group-test-gc7k7 |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Add automated E2E test that verifies deleting the MaaS Config CR and
parent operator top-level CRs (DataScienceCluster, DSCInitialization)
fully removes all MaaS-owned infrastructure. This prevents orphaned
controllers, workloads, routes, and namespaced CRs from surviving
uninstall, which complicates upgrades, reuse, and compliance reviews.
The test executes an ordered delete sequence (Config → DSC → DSCI),
then asserts within a bounded timeout that no MaaS CRD instances,
workloads, or HTTPRoutes remain. On failure it dumps all remaining
resources for debugging. The test runs as Phase 4 in the Prow smoke
pipeline, after all functional E2E tests and deployment validation.
Closes RHOAIENG-62678
Co-Authored-By: Claude noreply@anthropic.com
Signed-off-by: Jamie Land jland@redhat.com
Summary by CodeRabbit