Skip to content

test: validate Tenant reconcile health in e2e tests#925

Open
EgorLu wants to merge 1 commit into
opendatahub-io:mainfrom
EgorLu:test/RHOAIENG-62458-tenant-reconcile-health
Open

test: validate Tenant reconcile health in e2e tests#925
EgorLu wants to merge 1 commit into
opendatahub-io:mainfrom
EgorLu:test/RHOAIENG-62458-tenant-reconcile-health

Conversation

@EgorLu
Copy link
Copy Markdown
Contributor

@EgorLu EgorLu commented May 20, 2026

Summary

  • Adds test/e2e/tests/test_tenant_health.py with two test classes:
    • TestTenantReconcileHealth — asserts the Tenant CR reaches Phase=Active with healthy conditions (Ready=True, DependenciesAvailable, MaaSPrerequisitesAvailable, DeploymentsAvailable, Degraded=False). Fails fast on Phase=Failed with the reconciler's error message (e.g., RBAC forbidden).
    • TestTenantManagedResources — audits that expected platform resources (Deployments, Services, HTTPRoutes, AuthPolicies, PodMonitors, ClusterRoles, etc.) exist on the cluster by querying the maas.opendatahub.io/tenant-name tracking label.
  • Adds test_tenant_health.py as the first test file in prow_run_smoke_test.sh so reconcile failures fail the suite early with clear diagnostics.

Motivation

PR #813 added a PodMonitor without a matching RBAC marker. The Tenant reconciler failed on every loop with Forbidden, but e2e tests passed because they only check maas-api deployment readiness and API endpoints. This was caught by downstream ODH operator tests after merge and required hotfix PR #859.

These tests close that gap by validating the Tenant CR's own health status and auditing resource existence.

Jira: RHOAIENG-62458

Risk analysis

  • Risk rating: 1
  • Why: Test-only change. Adds new assertions to the e2e suite without modifying any production code, controller logic, or deployment manifests. Worst case is a false-positive test failure in CI, which is easily diagnosed and fixed.

Test plan

  • Python syntax verified (py_compile)
  • pytest collects all 16 tests with correct parametrization (--collect-only)
  • Prow CI runs prow_run_smoke_test.sh which now includes test_tenant_health.py first
  • Verify all 16 tests pass on a cluster with the Tenant CRD deployed

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added end-to-end tests validating Tenant reconciliation health and that expected cluster resources are created and managed (core, networking, RBAC, monitoring, optional dashboards/datasources).
    • Smoke test suite now runs the new Tenant health tests as part of regular verification.

@EgorLu
Copy link
Copy Markdown
Contributor Author

EgorLu commented May 20, 2026

@coderabbitai review

@openshift-ci openshift-ci Bot requested review from SB159 and liangwen12year May 20, 2026 12:57
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: EgorLu
Once this PR has been reviewed and has the lgtm label, please assign jland-redhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Warning

Rate limit exceeded

@EgorLu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 57 minutes and 32 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 964d8e01-406a-4f81-a7a7-a3691aada4c2

📥 Commits

Reviewing files that changed from the base of the PR and between c3f54f1 and f591f24.

📒 Files selected for processing (2)
  • test/e2e/scripts/prow_run_smoke_test.sh
  • test/e2e/tests/test_tenant_health.py
📝 Walkthrough

Walkthrough

Adds a new end-to-end pytest module that checks Tenant reconciliation health and validates resources the Tenant should create. Introduces resilient oc execution with retries/timeouts, helpers to fetch and poll Tenant conditions/phase, label-based resource listing with CRD presence checks, and two test classes (reconcile health and managed-resources assertions). The smoke-test script is updated to run this new module first.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes


Security and code quality findings

  • CWE-78 / CWE-77 (OS Command Injection): _run_oc constructs and executes shell commands. Ensure all oc arguments are controlled, avoid shell=True, and validate/escape inputs to prevent injection.
  • CWE-73 (External Control of File Name or Path): _run_oc retry/timeout logic logs command output; verify no untrusted input flows into command construction or logged paths.
  • CWE-20 (Improper Input Validation): JSON from oc is parsed and indexed (e.g., .status.conditions, .items) without schema checks. Add defensive checks for expected keys/types before access to avoid KeyError/TypeError.
  • CWE-539 / CWE-532 (Information Exposure Through Log Files): stderr/stdout from oc are logged on failures; redact or avoid logging potentially sensitive cluster data (tokens, URLs, internal identifiers).
  • Robustness / Flakiness risk: _wait_for_tenant_ready uses a hard timeout/poll cadence; make timeout configurable via environment to accommodate slow clusters and reduce flakes.
  • Unsupported-API handling: _list_resources_by_label may return empty or None for unsupported kinds. Tests assert mandatory resource counts without verifying CRD/API presence; add explicit CRD/API checks (like _crd_exists) before asserting required kinds or fail with a clear message.
  • Test configuration flexibility: Several minimum-count expectations are hardcoded; expose thresholds via environment or pytest parameters to support different cluster profiles.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test: validate Tenant reconcile health in e2e tests' directly describes the main change: adding e2e tests to validate Tenant health. It aligns with the changeset, which adds test_tenant_health.py and modifies the smoke test runner.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@EgorLu
Copy link
Copy Markdown
Contributor Author

EgorLu commented May 20, 2026

CI failures are unrelated to this PR:

  • lint / test: maas-api/internal/handlers/models_test.go:1294subscription.NewSelector called with wrong argument count. This is a pre-existing issue on main (Go code, not touched by this PR).
  • Prow image jobs: CI infrastructure failures, not related to test changes.

This PR only adds a Python test file and a one-line shell script change. No Go code was modified.

@EgorLu
Copy link
Copy Markdown
Contributor Author

EgorLu commented May 20, 2026

/retest

@EgorLu EgorLu force-pushed the test/RHOAIENG-62458-tenant-reconcile-health branch from 1f12a43 to 5e07b25 Compare May 20, 2026 13:42
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/tests/test_tenant_health.py`:
- Around line 45-47: The subprocess.run call in _run_oc should include a hard
timeout to prevent a stuck oc process from hanging tests; update the call inside
_run_oc to pass timeout=TENANT_HEALTH_TIMEOUT (or a suitably-named timeout
constant) and add handling for subprocess.TimeoutExpired—catch the exception in
_run_oc, log or return a failed result indicating a timeout, and ensure the
child process is treated as failed so TENANT_HEALTH_TIMEOUT is honored.
- Around line 139-142: The _crd_exists function currently checks CRD presence
with a substring match against the full `oc api-resources` table; change it to
call _run_oc(["api-resources", "--api-group", api_group, "-o", "name"]) so the
output contains only resource names, split the stdout into lines and build a set
of names, then check for exact membership of `resource` (or resource.lower() if
normalizing) in that set; update the call site in _crd_exists and keep using
_run_oc for execution and rc handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 0bae1e72-2d60-41fd-8627-b6304a9fc825

📥 Commits

Reviewing files that changed from the base of the PR and between 43bd2ba and 5e07b25.

📒 Files selected for processing (2)
  • test/e2e/scripts/prow_run_smoke_test.sh
  • test/e2e/tests/test_tenant_health.py

Comment thread test/e2e/tests/test_tenant_health.py Outdated
Comment thread test/e2e/tests/test_tenant_health.py Outdated
@EgorLu
Copy link
Copy Markdown
Contributor Author

EgorLu commented May 20, 2026

/retest

@EgorLu EgorLu requested a review from a team as a code owner May 21, 2026 10:15
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/tests/test_tenant_health.py`:
- Around line 46-47: The test currently invokes subprocess.run(["oc", *args],
...) which resolves "oc" via PATH and can be hijacked; resolve and pin the
absolute path once (e.g., call shutil.which("oc") into a variable like oc_path
at module init or test setup, fail the test if not found) and replace
subprocess.run(["oc", *args], ...) with subprocess.run([oc_path, *args], ...) so
the binary is unambiguous and you explicitly error if the binary cannot be
located.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b70357dc-581f-452b-bc87-3c48a7949a68

📥 Commits

Reviewing files that changed from the base of the PR and between 5e07b25 and c3f54f1.

📒 Files selected for processing (1)
  • test/e2e/tests/test_tenant_health.py

Comment thread test/e2e/tests/test_tenant_health.py
@ishitasequeira
Copy link
Copy Markdown
Contributor

@EgorLu , can you check the PR #869 which is trying to add tests for Tenant status checks as well? Are the cases introduced by your tests not covered in that PR?

@EgorLu
Copy link
Copy Markdown
Contributor Author

EgorLu commented May 21, 2026

@ishitasequeira Good catch — there is overlap with #869. Here's the breakdown:

Duplicated by #869 (can be removed from this PR):

  • Tenant existence check
  • Ready condition + phase validation (Active/Degraded)
  • Conditions presence check

Unique to this PR (not covered by #869):

  • TestTenantManagedResources — label-based resource audit that queries the cluster for all resources the Tenant reconciler applied via the maas.opendatahub.io/tenant-name tracking label. Asserts Deployments, Services, HTTPRoutes, AuthPolicies, PodMonitors, ClusterRoles, etc. exist. This is the part that would have caught the PodMonitor RBAC incident from PR feat(metrics): add Prometheus metrics instrumentation to maas-api #813.
  • Asserts specific condition values (DependenciesAvailable=True, MaaSPrerequisitesAvailable=True, DeploymentsAvailable=True), not just that conditions exist with the right shape.

I'll rebase on main (which now has #869) and strip the duplicated TestTenantReconcileHealth class, keeping only the resource audit and the condition-value assertions that #869 doesn't cover.

Add test_tenant_health.py with label-based resource audit that queries
the cluster for all resources the Tenant reconciler applied via the
maas.opendatahub.io/tenant-name tracking label. Catches RBAC gaps and
manifest-apply failures (like the PodMonitor incident from PR opendatahub-io#813)
at PR time instead of after merge.

Complements test_tenant.py (PR opendatahub-io#869) which covers Tenant lifecycle,
phase, and status shape. This PR adds:
- TestTenantConditionValues: asserts specific condition values
  (DependenciesAvailable, MaaSPrerequisitesAvailable,
  DeploymentsAvailable) are True
- TestTenantManagedResources: audits core, networking, RBAC,
  monitoring, and optional observability resources exist

RHOAIENG-62458

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EgorLu EgorLu force-pushed the test/RHOAIENG-62458-tenant-reconcile-health branch from c3f54f1 to f591f24 Compare May 21, 2026 14:27
@EgorLu
Copy link
Copy Markdown
Contributor Author

EgorLu commented May 21, 2026

Shortened the tests quite a lot :)

@rhods-ci-bot
Copy link
Copy Markdown

@EgorLu: The following test has Failed:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:maas-group-test-knp7v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants