Skip to content

feat: add E2E tests for external models (egress)#632

Open
noyitz wants to merge 3 commits intoopendatahub-io:mainfrom
noyitz:feat/external-model-e2e-tests
Open

feat: add E2E tests for external models (egress)#632
noyitz wants to merge 3 commits intoopendatahub-io:mainfrom
noyitz:feat/external-model-e2e-tests

Conversation

@noyitz
Copy link
Copy Markdown
Collaborator

@noyitz noyitz commented Mar 27, 2026

Summary

E2E tests for the ExternalModel feature, focused on MaaS capabilities:

  • Discovery: ExternalModel reconciler creates MaaSModelRef, HTTPRoute, backend Service
  • Auth: Invalid/missing API key returns 401/403
  • Egress: Request with valid key passes auth and reaches external endpoint
  • Cleanup: Deleting MaaSModelRef removes HTTPRoute via finalizer

Uses httpbin.org as the external endpoint (configurable via
E2E_EXTERNAL_ENDPOINT). No BBR/plugin dependency — tests validate
MaaS egress routing and auth, not payload transformation.

Changes

  • test/e2e/tests/test_external_models.py: 7 tests covering discovery,
    auth, egress connectivity, and cleanup
  • test/e2e/scripts/prow_run_smoke_test.sh: External model tests section
    (commented out until CI includes ExternalModel reconciler)

Test plan

  • All 7 tests passing against RHOAI cluster with httpbin.org
  • No BBR or simulator dependency
NONE

Summary by CodeRabbit

  • Tests

    • Added E2E tests for external-model discovery, auth (invalid/missing API keys), egress/forwarding to external endpoints, and cleanup to ensure routes are removed.
    • New module-scoped setup provisions credentials, model/subscription resources, creates an API key, and tears down resources after tests.
  • Chores

    • CI smoke runner now executes the external-model E2E suite (replacing the previous external test), producing separate test artifacts.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: noyitz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new end-to-end pytest module test/e2e/tests/test_external_models.py that provisions Kubernetes CRs and a credential Secret via oc, polls CR status, creates an API key, and exercises the gateway over HTTP to validate discovery, auth, egress, and cleanup. The module includes subprocess-based OC helpers, a module-scoped fixture external_models_setup, and four test classes covering model discovery, auth rejection, egress forwarding, and cleanup. The CI script test/e2e/scripts/prow_run_smoke_test.sh is adjusted to run test_external_models.py instead of test_external_oidc.py in the e2e pytest invocation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Security findings

  • CWE-78 (OS Command Injection): Helpers invoke oc via shell subprocesses using environment-driven values. Actionable: stop using shell=True; build subprocess calls with argument arrays or use a Kubernetes client SDK; strictly validate/sanitize any input interpolated into commands.

  • CWE-295 (Improper Certificate Validation): TLS verification is optionally disabled when contacting the external endpoint. Actionable: limit TLS skip to ephemeral test-only contexts, fail the test if TLS must be skipped in production-like environments, and avoid persisting any global TLS configuration changes.

  • CWE-522 (Insufficiently Protected Credentials): Tests create and use API keys and Kubernetes Secrets. Actionable: create short-lived test keys, scope keys to minimal permissions, avoid logging secrets, and redact them from CI artifacts.

  • CWE-269 / CWE-285 (Excessive Privileges / Improper Authorization): Tests create/patch/delete CRs and remove finalizers. Actionable: run tests with a narrowly scoped service account, require explicit approval for finalizer removal, and ensure recorded justification and audit logging for finalizer edits.

  • CWE-200 (Exposure of Sensitive Information): Test logs and JUnit/HTML artifacts may contain API keys or tokens. Actionable: scrub/redact secrets from test output before storing artifacts and ensure CI does not persist raw secret values.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add E2E tests for external models (egress)' directly and clearly summarizes the main change: introducing end-to-end tests for external model functionality with a focus on egress routing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
test/e2e/tests/test_external_models.py (1)

228-231: Forcibly clearing finalizers bypasses controller cleanup.

Patching finalizers: [] before deletion prevents the reconciler from running its cleanup logic. This could leave orphaned HTTPRoutes or Services if the test environment is reused. Consider deleting without finalizer removal first, with a short timeout, and only force-remove finalizers if deletion hangs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 228 - 231, The test
currently force-clears finalizers on the maasmodelref via _patch_cr(p["name"],
MODEL_NAMESPACE, {"metadata": {"finalizers": []}}) before calling _delete_cr,
which bypasses controller cleanup; change the flow to call
_delete_cr("maasmodelref", p["name"], MODEL_NAMESPACE) first and wait briefly
for the resource to be removed (use an existing wait/poll helper or short
timeout), and only if the resource still exists after the timeout call _patch_cr
to clear finalizers and then _delete_cr again; keep references to the same
identifiers (p["name"], MODEL_NAMESPACE, _delete_cr, _patch_cr) so the logic is
easy to locate and review.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 344-368: The test test_rate_limit_returns_429 is flaky because it
assumes ~10 tokens per response; update the test to be deterministic by either
configuring the external_models_setup fixture to return a fixed token usage for
the chosen provider (PROVIDERS[1]) or, if fixture changes are not possible,
increase the number of requests and add simple retry/backoff logic in the
for-loop that calls the endpoint (use a larger max iterations and sleep between
batches) until a 429 is observed or a higher threshold is reached; modify the
loop that posts to url and the final assertion to reflect the new max attempts
and retry behavior so the test reliably observes a 429 without relying on
assumed token counts.
- Line 38: RECONCILE_WAIT is set using int(os.environ.get(...)) which will raise
a ValueError if E2E_RECONCILE_WAIT contains non-numeric text; update the
initialization in test_external_models.py to safely parse the env var (e.g.,
read the string, try to convert with int() inside a try/except, or use
str.isdigit() validation) and fall back to the default 12 (or log a warning)
when conversion fails so test collection doesn’t crash.
- Around line 214-215: The code logs a slice of api_key without checking for
None, which will raise TypeError if the response lacks "key"; update the test
around the api_key assignment (variable api_key and the log.info call) to first
assert or check that api_key is not None (or truthy), and if it is missing fail
the test with a clear message or log the full response before slicing; only
perform api_key[:15] after confirming api_key exists so the log.info call is
safe.
- Around line 376-424: The test_delete_removes_httproute test can leak the
temporary ExternalModel if an assertion fails because the cleanup call
_delete_cr("externalmodel", temp_name, MODEL_NAMESPACE) is after assertions;
wrap the main test logic (creation, waits, assertions, deletion of MaaSModelRef)
in a try/finally and move the ExternalModel deletion into the finally block (or
use a fixture that yields and always deletes the created ExternalModel) so that
_delete_cr("externalmodel", temp_name, MODEL_NAMESPACE) always runs even when
assertions in test_delete_removes_httproute fail.
- Around line 80-88: The _get_cr function currently calls
json.loads(result.stdout) without guarding against malformed JSON; wrap the JSON
parsing in a try/except catching json.JSONDecodeError (and optionally ValueError
for older Python) and return None (or log the parsing error) when decoding fails
so a non-JSON or partial output from oc doesn't raise; update the function
around the json.loads call to handle the exception and reference _get_cr and
json.loads/JSONDecodeError in your fix.

---

Nitpick comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 228-231: The test currently force-clears finalizers on the
maasmodelref via _patch_cr(p["name"], MODEL_NAMESPACE, {"metadata":
{"finalizers": []}}) before calling _delete_cr, which bypasses controller
cleanup; change the flow to call _delete_cr("maasmodelref", p["name"],
MODEL_NAMESPACE) first and wait briefly for the resource to be removed (use an
existing wait/poll helper or short timeout), and only if the resource still
exists after the timeout call _patch_cr to clear finalizers and then _delete_cr
again; keep references to the same identifiers (p["name"], MODEL_NAMESPACE,
_delete_cr, _patch_cr) so the logic is easy to locate and review.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 797db861-a123-432e-bc0c-32fe0a5ea895

📥 Commits

Reviewing files that changed from the base of the PR and between d64c0fe and 5020b24.

📒 Files selected for processing (1)
  • test/e2e/tests/test_external_models.py

SUBSCRIPTION_NAMESPACE = os.environ.get("E2E_SUBSCRIPTION_NAMESPACE", "models-as-a-service")
EXTERNAL_SUBSCRIPTION = os.environ.get("E2E_EXTERNAL_SUBSCRIPTION", "e2e-external-subscription")
EXTERNAL_AUTH_POLICY = os.environ.get("E2E_EXTERNAL_AUTH_POLICY", "e2e-external-access")
RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12"))
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Uncaught ValueError if E2E_RECONCILE_WAIT is non-numeric.

If the environment variable contains a non-integer string, int() will raise ValueError, crashing test collection.

Proposed fix
-RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12"))
+try:
+    RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12"))
+except ValueError:
+    RECONCILE_WAIT = 12
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12"))
try:
RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12"))
except ValueError:
RECONCILE_WAIT = 12
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` at line 38, RECONCILE_WAIT is set
using int(os.environ.get(...)) which will raise a ValueError if
E2E_RECONCILE_WAIT contains non-numeric text; update the initialization in
test_external_models.py to safely parse the env var (e.g., read the string, try
to convert with int() inside a try/except, or use str.isdigit() validation) and
fall back to the default 12 (or log a warning) when conversion fails so test
collection doesn’t crash.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still applicable — but this is an E2E test script, not production code. If the env var is wrong, crashing early with a clear ValueError is the right behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noyitz — fair point. For E2E test infrastructure, fail-fast is the right call: silently falling back to 12 could mask a misconfigured env var and produce misleading results. A hard ValueError surfaces the problem immediately at collection time. Withdrawing the suggestion.

Comment on lines +80 to +88
def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]:
"""Get a Kubernetes resource as dict, or None if not found."""
result = subprocess.run(
["oc", "get", kind, name, "-n", namespace, "-o", "json"],
capture_output=True, text=True,
)
if result.returncode != 0:
return None
return json.loads(result.stdout)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

json.loads can raise JSONDecodeError on malformed output.

If oc get returns non-JSON output (e.g., partial output on timeout, error message despite exit code 0), json.loads will raise an unhandled exception.

Proposed fix
 def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]:
     """Get a Kubernetes resource as dict, or None if not found."""
     result = subprocess.run(
         ["oc", "get", kind, name, "-n", namespace, "-o", "json"],
         capture_output=True, text=True,
     )
     if result.returncode != 0:
         return None
-    return json.loads(result.stdout)
+    try:
+        return json.loads(result.stdout)
+    except json.JSONDecodeError:
+        log.warning(f"Failed to parse JSON for {kind}/{name}: {result.stdout[:100]}")
+        return None
🧰 Tools
🪛 Ruff (0.15.7)

[error] 82-82: subprocess call: check for execution of untrusted input

(S603)


[error] 83-83: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 80 - 88, The _get_cr
function currently calls json.loads(result.stdout) without guarding against
malformed JSON; wrap the JSON parsing in a try/except catching
json.JSONDecodeError (and optionally ValueError for older Python) and return
None (or log the parsing error) when decoding fails so a non-JSON or partial
output from oc doesn't raise; update the function around the json.loads call to
handle the exception and reference _get_cr and json.loads/JSONDecodeError in
your fix.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, but _get_cr already checks returncode != 0 and returns None. If oc get exits 0 with non-JSON output, that's a cluster bug. Keeping it simple for E2E tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noyitz, fair enough — oc get -o json exiting 0 with non-JSON output is indeed outside the contract of the tool, and hardening against cluster-level defects in E2E test helpers adds noise without real benefit. Acknowledged.

Comment on lines +214 to +215
api_key = r.json().get("key")
log.info(f"API key created: {api_key[:15]}...")
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing null check on api_key before slicing.

If the response JSON lacks a key field, api_key will be None, and api_key[:15] will raise TypeError. The test will fail with an unclear error.

Proposed fix
     api_key = r.json().get("key")
+    if not api_key:
+        pytest.fail(f"API key response missing 'key' field: {r.json()}")
-    log.info(f"API key created: {api_key[:15]}...")
+    log.info(f"API key created: {api_key[:15]}...")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
api_key = r.json().get("key")
log.info(f"API key created: {api_key[:15]}...")
api_key = r.json().get("key")
if not api_key:
pytest.fail(f"API key response missing 'key' field: {r.json()}")
log.info(f"API key created: {api_key[:15]}...")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 214 - 215, The code logs
a slice of api_key without checking for None, which will raise TypeError if the
response lacks "key"; update the test around the api_key assignment (variable
api_key and the log.info call) to first assert or check that api_key is not None
(or truthy), and if it is missing fail the test with a clear message or log the
full response before slicing; only perform api_key[:15] after confirming api_key
exists so the log.info call is safe.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — but the line above already calls pytest.fail() if the status code isn't 200/201, so we'd never reach this line with a bad response. The `.get("key")" returning None would be a server bug worth crashing on.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noyitz, fair point — the status-code guard does significantly reduce the probability of hitting this path with a malformed response. The residual risk is a 200/201 reply that is well-formed HTTP but has an unexpected JSON body (e.g., a schema change where the key moves to "apiKey" instead of "key"). In that case the TypeError stack trace is harder to diagnose than an explicit message, but that's a minor ergonomic concern rather than a correctness bug.

Acknowledged — happy to leave it as-is if you prefer a loud crash to signal a server contract violation.

noyitz added 2 commits March 29, 2026 19:04
Adds test_external_models.py with 29 tests covering:
- Discovery: MaaSModelRef, HTTPRoute, backend service creation (5 providers)
- Inference: chat completions 200 + valid OpenAI format (5 providers)
- Auth: invalid key 401, no key 401
- Rate limiting: token rate limit 429
- Cleanup: deletion removes resources via finalizer

All tests parametrized across OpenAI, Anthropic, Azure OpenAI, Vertex AI,
and Bedrock providers using a configurable external simulator endpoint.

Environment: E2E_SIMULATOR_ENDPOINT (default: 3.150.113.9)
Commented out until BBR plugins are deployed as part of CI.
Prerequisites: BBR image, EnvoyFilter, simulator endpoint accessible.
@srampal
Copy link
Copy Markdown

srampal commented Mar 30, 2026

How many of these E2E tests were also verified with the corresponding real cloud services vs only with the simulator ?

@noyitz noyitz force-pushed the feat/external-model-e2e-tests branch from 6baea2b to 4cdd1b2 Compare March 30, 2026 22:59
@noyitz noyitz changed the title feat: add E2E tests for external models (all 5 providers) feat: add E2E tests for external models (egress) Mar 30, 2026
@noyitz noyitz force-pushed the feat/external-model-e2e-tests branch from 4cdd1b2 to faba622 Compare March 30, 2026 23:02
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (4)
test/e2e/tests/test_external_models.py (4)

290-341: ⚠️ Potential issue | 🟡 Minor

Temporary cleanup is not guaranteed when assertions fail.

Wrap create/assert/delete logic in try/finally so temporary MaaSModelRef and ExternalModel resources are always deleted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 290 - 341, The
test_delete_removes_httproute creates temporary ExternalModel and MaaSModelRef
resources but doesn't guarantee cleanup on assertion failure; refactor
test_delete_removes_httproute to wrap the creation, verification, and deletion
logic in a try/finally block so that _delete_cr is always called for both the
MaaSModelRef and the ExternalModel (use temp_name and MODEL_NAMESPACE), keeping
the existing _apply_cr, _get_cr, and asserts inside try and moving the two
_delete_cr calls into finally to ensure cleanup even on failures.

187-188: ⚠️ Potential issue | 🟡 Minor

API key is sliced without validating presence.

If the response omits "key", this throws TypeError before a clear failure message.

Proposed fix
 api_key = r.json().get("key")
+if not api_key:
+    pytest.fail(f"API key response missing 'key': {r.text}")
 log.info(f"API key created: {api_key[:15]}...")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 187 - 188, The test
slices api_key without checking it exists (api_key = r.json().get("key");
log.info(f"API key created: {api_key[:15]}...")), which will raise TypeError if
"key" is missing; update the test to validate presence first by asserting or
raising a clear error when api_key is falsy (e.g., assert api_key is not None or
raise RuntimeError with response content), then perform the safe slice for
logging (or log the full value only after validation) so failures report the
missing key and response details clearly.

72-80: ⚠️ Potential issue | 🟡 Minor

_get_cr does not handle malformed JSON output from oc get.

If stdout is partial/non-JSON, json.loads raises and aborts the suite.

Proposed fix
 def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]:
@@
-    return json.loads(result.stdout)
+    try:
+        return json.loads(result.stdout)
+    except json.JSONDecodeError:
+        log.warning(f"Failed to parse JSON for {kind}/{name}: {result.stdout[:200]}")
+        return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 72 - 80, The _get_cr
function currently calls json.loads(result.stdout) unguarded, which crashes the
test suite if oc returns partial or malformed JSON; update _get_cr to catch JSON
decoding errors (catch json.JSONDecodeError) and return None (or optionally log
the stdout/stderr for debugging) when decoding fails, and ensure you still check
result.returncode != 0 before parsing; reference the _get_cr function and the
json.loads call when making the change.

36-36: ⚠️ Potential issue | 🟡 Minor

E2E_RECONCILE_WAIT can crash test collection on non-numeric input.

A non-integer env value raises ValueError at import time.

Proposed fix
-RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12"))
+try:
+    RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12"))
+except ValueError:
+    RECONCILE_WAIT = 12
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` at line 36, The constant
E2E_RECONCILE_WAIT is parsed with int(...) directly from os.environ which will
raise ValueError during import for non-numeric env values; change the
initialization of E2E_RECONCILE_WAIT to parse the environment value safely by
reading os.environ.get("E2E_RECONCILE_WAIT") into a variable and converting
inside a try/except (or using a safe numeric check), falling back to the default
12 when conversion fails or the value is missing so imports won't crash; update
the usage site where E2E_RECONCILE_WAIT is referenced accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Around line 505-525: Replace the commented external-model block with a
feature-gated runtime check using an env var (e.g. RUN_EXTERNAL_MODEL_E2E); if
RUN_EXTERNAL_MODEL_E2E is true, export defaults for E2E_SIMULATOR_ENDPOINT and
E2E_SKIP_TLS_VERIFY (if not already set), echo a start message, run pytest on
test_external_models.py with the same flags, and exit non-zero on failure so CI
fails explicitly; keep the original pytest invocation and artifact variables
(ARTIFACTS_DIR, test_dir) and ensure the script prints a clear error when tests
fail.

In `@test/e2e/tests/test_external_models.py`:
- Around line 39-40: The test currently only uses a single provider (openai) via
EXTERNAL_MODEL_NAME; update the test fixtures and test functions in
test_external_models.py to be parameterized over providers and corresponding
model names (e.g., providers =
["openai","anthropic","azure_openai","vertex","bedrock"]) so each provider path
is exercised; modify or add a parametrized fixture (e.g.,
external_model_name/provider_name fixtures) and update the test(s) that call
discovery/auth/egress checks to run per-provider, implementing provider-specific
expectations inside the test (use conditional branches or a small mapping of
provider->expected discovery/auth/egress behavior) so discovery, authentication,
and egress assertions are executed for Anthropic, Azure OpenAI, Vertex, and
Bedrock as well as OpenAI. Ensure tests that previously referenced
EXTERNAL_MODEL_NAME now accept the provider param and use the mapped model
identifier for each provider.
- Around line 105-170: The fixture currently calls _apply_cr(...) repeatedly
without checking results, which lets CR creation failures go unnoticed; update
the fixture in test_external_models.py so each _apply_cr call (for the Secret,
ExternalModel, MaaSModelRef, MaaSAuthPolicy, and MaaSSubscription that reference
EXTERNAL_MODEL_NAME, EXTERNAL_AUTH_POLICY, EXTERNAL_SUBSCRIPTION,
MODEL_NAMESPACE, and SUBSCRIPTION_NAMESPACE) immediately detects failure and
aborts the setup by calling pytest.fail(...) (or raising) with an explanatory
message; you can do this by checking the return value or catching exceptions
from _apply_cr and invoking pytest.fail with the resource name and error
details.
- Line 173: Replace the fixed sleep call time.sleep(RECONCILE_WAIT * 2) with a
polling-based wait that uses the existing helper _wait_for_phase(...) and
explicit existence/non-existence assertions so tests wait for a concrete
reconcile condition with a timeout; locate the occurrences of
time.sleep(RECONCILE_WAIT * 2) (also the similar uses at the other two spots)
and change them to call _wait_for_phase(resource_identifier, expected_phase,
timeout=...) and/or loop-check the resource existence until the condition is
met, failing the test if the timeout elapses.
- Line 31: The code sets EXTERNAL_ENDPOINT from E2E_EXTERNAL_ENDPOINT but
ignores the simulator variable used by CI/docs; update EXTERNAL_ENDPOINT so it
first checks E2E_EXTERNAL_ENDPOINT and then falls back to E2E_SIMULATOR_ENDPOINT
before defaulting to "httpbin.org" (i.e., use nested/os.environ.get calls or
equivalent to read both env vars), referencing the EXTERNAL_ENDPOINT symbol and
the env names E2E_EXTERNAL_ENDPOINT and E2E_SIMULATOR_ENDPOINT so cluster runs
don’t silently fall back to the wrong endpoint.
- Around line 275-282: The current egress check only ensures r.status_code not
in (401, 403) which can be a false positive; change the assertion to require a
successful HTTP 200 (e.g., assert r.status_code == 200) and then parse/validate
the response body to confirm it matches the chat-completions format expected by
the test: verify Content-Type is JSON, that the parsed JSON contains top-level
keys like "model" and "choices", that "choices" is a non-empty list, and each
choice contains a "message" with "role" and "content" (or equivalent fields your
code expects). Locate the assertion around the variable r and the log.info call
in the test_external_models test and replace the weak status check with these
stronger status and response structure validations.

---

Duplicate comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 290-341: The test_delete_removes_httproute creates temporary
ExternalModel and MaaSModelRef resources but doesn't guarantee cleanup on
assertion failure; refactor test_delete_removes_httproute to wrap the creation,
verification, and deletion logic in a try/finally block so that _delete_cr is
always called for both the MaaSModelRef and the ExternalModel (use temp_name and
MODEL_NAMESPACE), keeping the existing _apply_cr, _get_cr, and asserts inside
try and moving the two _delete_cr calls into finally to ensure cleanup even on
failures.
- Around line 187-188: The test slices api_key without checking it exists
(api_key = r.json().get("key"); log.info(f"API key created:
{api_key[:15]}...")), which will raise TypeError if "key" is missing; update the
test to validate presence first by asserting or raising a clear error when
api_key is falsy (e.g., assert api_key is not None or raise RuntimeError with
response content), then perform the safe slice for logging (or log the full
value only after validation) so failures report the missing key and response
details clearly.
- Around line 72-80: The _get_cr function currently calls
json.loads(result.stdout) unguarded, which crashes the test suite if oc returns
partial or malformed JSON; update _get_cr to catch JSON decoding errors (catch
json.JSONDecodeError) and return None (or optionally log the stdout/stderr for
debugging) when decoding fails, and ensure you still check result.returncode !=
0 before parsing; reference the _get_cr function and the json.loads call when
making the change.
- Line 36: The constant E2E_RECONCILE_WAIT is parsed with int(...) directly from
os.environ which will raise ValueError during import for non-numeric env values;
change the initialization of E2E_RECONCILE_WAIT to parse the environment value
safely by reading os.environ.get("E2E_RECONCILE_WAIT") into a variable and
converting inside a try/except (or using a safe numeric check), falling back to
the default 12 when conversion fails or the value is missing so imports won't
crash; update the usage site where E2E_RECONCILE_WAIT is referenced accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 2f167364-764d-48a6-8668-107e63e31d9f

📥 Commits

Reviewing files that changed from the base of the PR and between 5020b24 and 4cdd1b2.

📒 Files selected for processing (2)
  • test/e2e/scripts/prow_run_smoke_test.sh
  • test/e2e/tests/test_external_models.py

Comment on lines +505 to +525
# --- External Model E2E Tests ---
# Uncomment when BBR plugins are deployed as part of CI and the external
# simulator endpoint is accessible from the CI cluster.
# Prerequisites:
# 1. BBR image built and deployed (EnvoyFilter, NetworkPolicy, DestinationRule)
# 2. E2E_SIMULATOR_ENDPOINT set (default: 3.150.113.9)
# 3. ServiceEntry for the simulator registered in the mesh
#
# export E2E_SIMULATOR_ENDPOINT="${E2E_SIMULATOR_ENDPOINT:-3.150.113.9}"
# export E2E_SKIP_TLS_VERIFY="true"
# echo "Running external model e2e tests..."
# if ! PYTHONPATH="$test_dir:${PYTHONPATH:-}" pytest \
# -v --maxfail=5 --disable-warnings \
# --junitxml="${ARTIFACTS_DIR}/e2e-external-models.xml" \
# --html="${ARTIFACTS_DIR}/e2e-external-models.html" --self-contained-html \
# --capture=tee-sys --show-capture=all --log-level=INFO \
# "$test_dir/tests/test_external_models.py" ; then
# echo "❌ ERROR: External model E2E tests failed"
# exit 1
# fi

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

External-model smoke coverage is disabled by commented execution path.

This block never runs, so CI cannot catch regressions in test_external_models.py. Replace the commented section with a real feature gate (for example, RUN_EXTERNAL_MODEL_E2E=true) and execute/fail explicitly when enabled.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scripts/prow_run_smoke_test.sh` around lines 505 - 525, Replace the
commented external-model block with a feature-gated runtime check using an env
var (e.g. RUN_EXTERNAL_MODEL_E2E); if RUN_EXTERNAL_MODEL_E2E is true, export
defaults for E2E_SIMULATOR_ENDPOINT and E2E_SKIP_TLS_VERIFY (if not already
set), echo a start message, run pytest on test_external_models.py with the same
flags, and exit non-zero on failure so CI fails explicitly; keep the original
pytest invocation and artifact variables (ARTIFACTS_DIR, test_dir) and ensure
the script prints a clear error when tests fail.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. may be just missed to uncomment :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaitanya1731, agreed — easy fix! Here's the replacement for the commented block, following the same feature-gate pattern already used by EXTERNAL_OIDC, SKIP_DEPLOYMENT, etc. in the script:

    # --- External Model E2E Tests ---
    # Set RUN_EXTERNAL_MODEL_E2E=true when BBR plugins are deployed as part of CI
    # and the external simulator endpoint is accessible from the CI cluster.
    # Prerequisites:
    #   1. BBR image built and deployed (EnvoyFilter, NetworkPolicy, DestinationRule)
    #   2. E2E_SIMULATOR_ENDPOINT set (default: 3.150.113.9)
    #   3. ServiceEntry for the simulator registered in the mesh
    if [[ "${RUN_EXTERNAL_MODEL_E2E:-false}" == "true" ]]; then
        export E2E_SIMULATOR_ENDPOINT="${E2E_SIMULATOR_ENDPOINT:-3.150.113.9}"
        export E2E_SKIP_TLS_VERIFY="true"
        echo "Running external model e2e tests (E2E_SIMULATOR_ENDPOINT=${E2E_SIMULATOR_ENDPOINT})..."
        if ! PYTHONPATH="$test_dir:${PYTHONPATH:-}" pytest \
            -v --maxfail=5 --disable-warnings \
            --junitxml="${ARTIFACTS_DIR}/e2e-external-models.xml" \
            --html="${ARTIFACTS_DIR}/e2e-external-models.html" --self-contained-html \
            --capture=tee-sys --show-capture=all --log-level=INFO \
            "$test_dir/tests/test_external_models.py" ; then
            echo "❌ ERROR: External model E2E tests failed"
            exit 1
        fi
        echo "✅ External model E2E tests completed"
    else
        echo "⏭️  Skipping external model E2E tests (RUN_EXTERNAL_MODEL_E2E not set to true)"
    fi

Also add RUN_EXTERNAL_MODEL_E2E to the ENVIRONMENT VARIABLES header block at the top of the script:

#   RUN_EXTERNAL_MODEL_E2E - Run external model E2E tests (default: false)
#                            Requires BBR deployed and E2E_SIMULATOR_ENDPOINT accessible
#   E2E_SIMULATOR_ENDPOINT - External model simulator endpoint (default: 3.150.113.9)
#                            Used only when RUN_EXTERNAL_MODEL_E2E=true

This way the gate is explicit, CI fails loudly when enabled, and the default keeps current CI behaviour unchanged.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the tests no longer depend on BBR or the simulator (they use httpbin.org now). Will uncomment so they run in CI.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed — the external model tests are uncommented in the latest revision.

Comment on lines +39 to +40
EXTERNAL_MODEL_NAME = "e2e-external-model"

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Suite is single-provider only (openai) and does not exercise provider-specific paths.

Parameterize provider/model fixtures and run discovery/auth/egress assertions per provider to cover Anthropic, Azure OpenAI, Vertex, and Bedrock flows.

Also applies to: 122-123, 208-282

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 39 - 40, The test
currently only uses a single provider (openai) via EXTERNAL_MODEL_NAME; update
the test fixtures and test functions in test_external_models.py to be
parameterized over providers and corresponding model names (e.g., providers =
["openai","anthropic","azure_openai","vertex","bedrock"]) so each provider path
is exercised; modify or add a parametrized fixture (e.g.,
external_model_name/provider_name fixtures) and update the test(s) that call
discovery/auth/egress checks to run per-provider, implementing provider-specific
expectations inside the test (use conditional branches or a small mapping of
provider->expected discovery/auth/egress behavior) so discovery, authentication,
and egress assertions are executed for Anthropic, Azure OpenAI, Vertex, and
Bedrock as well as OpenAI. Ensure tests that previously referenced
EXTERNAL_MODEL_NAME now accept the provider param and use the mapped model
identifier for each provider.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design — these tests validate MaaS egress capabilities (discovery, auth, routing), not provider-specific plugin behavior. Provider-specific testing is in the ai-gateway-payload-processing repo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noyitz, understood — the architectural boundary makes sense. MaaS egress routing, auth enforcement, and discovery are provider-agnostic concerns, and provider-specific payload/plugin coverage belongs in ai-gateway-payload-processing. I'll drop this suggestion.

·.¸¸.·´¯·.¸¸.·´¯·.¸¸.·´¯·.¸¯·.¸¸.·´¯·.¸¸.·´¯`·.¸¸.·´¯)
      (\ /)
      ( . .)
      c(")(")  hops along


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Comment on lines +105 to +170
_apply_cr({
"apiVersion": "v1",
"kind": "Secret",
"metadata": {
"name": f"{EXTERNAL_MODEL_NAME}-api-key",
"namespace": MODEL_NAMESPACE,
},
"type": "Opaque",
"stringData": {"api-key": "e2e-test-key"},
})

# Create ExternalModel CR
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "ExternalModel",
"metadata": {"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE},
"spec": {
"provider": "openai",
"endpoint": EXTERNAL_ENDPOINT,
"credentialRef": {
"name": f"{EXTERNAL_MODEL_NAME}-api-key",
"namespace": MODEL_NAMESPACE,
},
},
})

# Create MaaSModelRef
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "MaaSModelRef",
"metadata": {
"name": EXTERNAL_MODEL_NAME,
"namespace": MODEL_NAMESPACE,
"annotations": {
"maas.opendatahub.io/endpoint": EXTERNAL_ENDPOINT,
"maas.opendatahub.io/provider": "openai",
},
},
"spec": {
"modelRef": {"kind": "ExternalModel", "name": EXTERNAL_MODEL_NAME},
},
})

# Create MaaSAuthPolicy
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "MaaSAuthPolicy",
"metadata": {"name": EXTERNAL_AUTH_POLICY, "namespace": SUBSCRIPTION_NAMESPACE},
"spec": {
"modelRefs": [{"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}],
"subjects": {"groups": [{"name": "system:authenticated"}]},
},
})

# Create MaaSSubscription
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "MaaSSubscription",
"metadata": {"name": EXTERNAL_SUBSCRIPTION, "namespace": SUBSCRIPTION_NAMESPACE},
"spec": {
"owner": {"groups": [{"name": "system:authenticated"}]},
"modelRefs": [
{"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE},
],
},
})
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fixture ignores CR creation failures and proceeds with invalid setup state.

Every _apply_cr(...) here should be checked, and fixture setup should pytest.fail(...) immediately on failure to avoid cascading false failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 105 - 170, The fixture
currently calls _apply_cr(...) repeatedly without checking results, which lets
CR creation failures go unnoticed; update the fixture in test_external_models.py
so each _apply_cr call (for the Secret, ExternalModel, MaaSModelRef,
MaaSAuthPolicy, and MaaSSubscription that reference EXTERNAL_MODEL_NAME,
EXTERNAL_AUTH_POLICY, EXTERNAL_SUBSCRIPTION, MODEL_NAMESPACE, and
SUBSCRIPTION_NAMESPACE) immediately detects failure and aborts the setup by
calling pytest.fail(...) (or raising) with an explanatory message; you can do
this by checking the return value or catching exceptions from _apply_cr and
invoking pytest.fail with the resource name and error details.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _apply_cr helper logs warnings on failure. If a CR fails to create, the dependent tests will fail with clear assertions. Adding pytest.fail on every apply would make the fixture brittle on retries.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noyitz, fair point on idempotency — oc apply re-runs are common and pytest.fail on each call would break that pattern.

One lightweight middle ground worth considering: collect failures across all _apply_cr calls and emit a single pytest.fail (or pytest.skip) at the end of setup if any failed. This preserves retry-friendliness while surfacing the root cause before tests run with a half-provisioned state:

failures = []

if not _apply_cr({...secret...}):
    failures.append(f"Secret/{EXTERNAL_MODEL_NAME}-api-key")
if not _apply_cr({...ExternalModel...}):
    failures.append(f"ExternalModel/{EXTERNAL_MODEL_NAME}")
# ... etc.

if failures:
    pytest.fail(f"CR creation failed for: {', '.join(failures)} — check oc apply output above")

This way a retry on a pre-existing cluster still works (all _apply_cr calls return True), but a genuine provisioning failure in CI gives a single, unambiguous root-cause message instead of 7 cascading assertion errors pointing at missing resources. Up to you whether that tradeoff is worth the extra lines.

( ˙▿˙)

})

# Wait for reconciler + auth propagation
time.sleep(RECONCILE_WAIT * 2)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fixed sleeps make reconciliation checks flaky.

Use polling with explicit conditions/timeouts (for example _wait_for_phase(...) plus existence/non-existence checks) instead of static sleeps.

Also applies to: 326-326, 334-334

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` at line 173, Replace the fixed sleep
call time.sleep(RECONCILE_WAIT * 2) with a polling-based wait that uses the
existing helper _wait_for_phase(...) and explicit existence/non-existence
assertions so tests wait for a concrete reconcile condition with a timeout;
locate the occurrences of time.sleep(RECONCILE_WAIT * 2) (also the similar uses
at the other two spots) and change them to call
_wait_for_phase(resource_identifier, expected_phase, timeout=...) and/or
loop-check the resource existence until the condition is met, failing the test
if the timeout elapses.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep is waiting for the reconciler to create Istio resources (HTTPRoute, Service, etc.) from the ExternalModel CR. Polling for these specific resources would add complexity without much benefit — the reconcile time is predictable in CI.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noyitz, understood — if reconcile latency is stable in your CI environment, the fixed sleep is reasonable. The residual risk is that under cluster load spikes or API server slowdowns the hardcoded window could still be exceeded, causing intermittent failures unrelated to the feature under test. If you ever see flakiness here, the quickest mitigation would be polling for the HTTPRoute existence (which _get_cr already supports) rather than extending the sleep value. No change needed from my side.


📝 Learnings were identified and not saved (knowledge base disabled). Enable

Comment on lines +275 to +282
assert r.status_code not in (401, 403), (
f"Request was blocked by auth (HTTP {r.status_code}). "
f"Expected the request to reach the external endpoint."
)
# Any non-auth response confirms egress connectivity.
# httpbin.org may return 404 for unknown paths — that's fine,
# it means the request left the cluster and reached the endpoint.
log.info(f"Egress test: HTTP {r.status_code} from external endpoint")
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Egress assertion is too weak and can pass on backend failures.

status_code not in (401, 403) allows false positives (for example 500/404). Assert expected success (HTTP 200) and validate required response fields for chat-completions format.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 275 - 282, The current
egress check only ensures r.status_code not in (401, 403) which can be a false
positive; change the assertion to require a successful HTTP 200 (e.g., assert
r.status_code == 200) and then parse/validate the response body to confirm it
matches the chat-completions format expected by the test: verify Content-Type is
JSON, that the parsed JSON contains top-level keys like "model" and "choices",
that "choices" is a non-empty list, and each choice contains a "message" with
"role" and "content" (or equivalent fields your code expects). Locate the
assertion around the variable r and the log.info call in the
test_external_models test and replace the weak status check with these stronger
status and response structure validations.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test confirms the request passed auth and reached the external endpoint. httpbin.org may return various codes depending on the path. The key assertion is that auth didn't block it (not 401/403). A 404 from httpbin still proves egress works.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noyitz, that reasoning is sound for the httpbin case — a 404 from httpbin definitively proves the request left the cluster. The one residual ambiguity is gateway-originated 5xx errors (502/503/504): those can satisfy not in (401, 403) even when the request never reached the external endpoint (e.g., the egress rule is misconfigured and the gateway itself is timing out). A simple guard like assert r.status_code != 502, "Gateway bad-gateway — request may not have reached external endpoint" could differentiate those. That said, if the intent is strictly "auth didn't block it" and you're comfortable treating 5xx as an acceptable egress signal for now, I understand the trade-off.

(╯°□°)╯

Comment on lines +505 to +525
# --- External Model E2E Tests ---
# Uncomment when BBR plugins are deployed as part of CI and the external
# simulator endpoint is accessible from the CI cluster.
# Prerequisites:
# 1. BBR image built and deployed (EnvoyFilter, NetworkPolicy, DestinationRule)
# 2. E2E_SIMULATOR_ENDPOINT set (default: 3.150.113.9)
# 3. ServiceEntry for the simulator registered in the mesh
#
# export E2E_SIMULATOR_ENDPOINT="${E2E_SIMULATOR_ENDPOINT:-3.150.113.9}"
# export E2E_SKIP_TLS_VERIFY="true"
# echo "Running external model e2e tests..."
# if ! PYTHONPATH="$test_dir:${PYTHONPATH:-}" pytest \
# -v --maxfail=5 --disable-warnings \
# --junitxml="${ARTIFACTS_DIR}/e2e-external-models.xml" \
# --html="${ARTIFACTS_DIR}/e2e-external-models.html" --self-contained-html \
# --capture=tee-sys --show-capture=all --log-level=INFO \
# "$test_dir/tests/test_external_models.py" ; then
# echo "❌ ERROR: External model E2E tests failed"
# exit 1
# fi

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. may be just missed to uncomment :)

"$test_dir/tests/test_namespace_scoping.py" \
"$test_dir/tests/test_subscription.py" \
"$test_dir/tests/test_models_endpoint.py" \
"$test_dir/tests/test_external_oidc.py" ; then
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need these two tests..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_external_oidc.py removal came from upstream main during rebase (PR #598 restructured OIDC handling). It's not a change from this PR — our diff only adds the external model tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure I just viewed the file on your repo and I don't see this change. Can we make sure to merge to main so we can be sure this test does not get lost

Image

@noyitz noyitz force-pushed the feat/external-model-e2e-tests branch from faba622 to 0027a79 Compare March 31, 2026 00:10
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
test/e2e/tests/test_external_models.py (2)

261-282: ⚠️ Potential issue | 🟠 Major

This does not prove the request reached the external service.

The test name/docstring say “returns 200”, but the assertion accepts any non-401/403 response. 404, 500, or 503 from the gateway/backend all pass here, so forwarding can be broken while the test stays green.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 261 - 282, The test
test_request_forwarded_returns_200 currently only checks that the response is
not 401/403 which doesn't prove the request reached the external service; change
the assertion to require an actual successful egress response (e.g. assert
r.status_code == 200) or explicitly allow only known valid upstream statuses
(e.g. 200 or an expected 404 from the external host) and update the test
docstring/name if you broaden the accepted statuses; modify the assertion around
the requests.post call that uses external_models_setup and EXTERNAL_MODEL_NAME
to enforce the specific expected status codes.

172-173: ⚠️ Potential issue | 🟠 Major

Replace these sleeps with condition-based waits.

Line 173 and Lines 327-335 are still guessing reconciliation timing. On a slow cluster the assertions fire too early; on a fast cluster the suite just burns time. Poll MaaSModelRef.status.phase plus HTTPRoute existence/non-existence instead of sleeping on a stopwatch.

Also applies to: 325-335

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 172 - 173, Replace blind
time.sleep calls (e.g., the RECONCILE_WAIT * 2 at the top and the block around
lines 327-335) with condition-based polling: implement a short-poll loop with a
global timeout that repeatedly fetches the MaaSModelRef resource and asserts on
MaaSModelRef.status.phase reaching the expected value and simultaneously checks
Kubernetes for the presence or absence of the corresponding HTTPRoute resource;
fail if the timeout elapses. Use or add a reusable helper (e.g.,
wait_for_condition or poll_until) called from the test to encapsulate retries,
sleeps, and timeout, and reference the resource by the same names used in the
test (MaaSModelRef and HTTPRoute) so the assertions fire only when the
reconciler/auth propagation actually completed.
test/e2e/scripts/prow_run_smoke_test.sh (1)

492-500: ⚠️ Potential issue | 🟠 Major

test_external_oidc.py is no longer executed.

This script still wires EXTERNAL_OIDC/OIDC_*, and test_external_oidc.py already self-skips unless EXTERNAL_OIDC=true. Dropping it from the pytest list silently removes that coverage instead of gating it.

Suggested fix
     if ! PYTHONPATH="$test_dir:${PYTHONPATH:-}" pytest \
         -v --maxfail=5 --disable-warnings \
         --junitxml="$xml" \
         --html="$html" --self-contained-html \
         --capture=tee-sys --show-capture=all --log-level=INFO \
         "$test_dir/tests/test_api_keys.py" \
         "$test_dir/tests/test_namespace_scoping.py" \
         "$test_dir/tests/test_subscription.py" \
-        "$test_dir/tests/test_models_endpoint.py" ; then
+        "$test_dir/tests/test_models_endpoint.py" \
+        "$test_dir/tests/test_external_oidc.py" ; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/scripts/prow_run_smoke_test.sh` around lines 492 - 500, The pytest
invocation in prow_run_smoke_test.sh was changed to remove
test_external_oidc.py, which means EXTERNAL_OIDC/OIDC_*-gated coverage is no
longer run; restore "$test_dir/tests/test_external_oidc.py" to the list of tests
passed to pytest (the same block that currently lists test_api_keys.py,
test_namespace_scoping.py, test_subscription.py, test_models_endpoint.py) so
that test_external_oidc.py is executed (it will self-skip when EXTERNAL_OIDC is
not true); update the pytest command in the shell script accordingly to include
that filename.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/tests/test_external_models.py`:
- Line 33: The test is reading the wrong env var: update SUBSCRIPTION_NAMESPACE
in test/e2e/tests/test_external_models.py to use
os.environ.get("MAAS_SUBSCRIPTION_NAMESPACE", "models-as-a-service") (instead of
"E2E_SUBSCRIPTION_NAMESPACE") so MaaSAuthPolicy and MaaSSubscription are created
in the same namespace the smoke script exports; keep the same default fallback
value and adjust any references to SUBSCRIPTION_NAMESPACE accordingly.
- Around line 97-203: The external_models_setup fixture can leak resources if an
exception occurs before yield (e.g., when pytest.fail() is called after API key
creation fails); wrap the setup sequence in a try/finally (or register a
finalizer with pytest.fixture.addfinalizer) so that cleanup calls (_delete_cr,
_patch_cr for "maasmodelref", and deletion of the Secret, ExternalModel,
MaaSAuthPolicy, MaaSSubscription) always run even if external_models_setup
raises before yield; locate and protect the block that calls _apply_cr, the
requests.post that may call pytest.fail, and the yield within the
external_models_setup function to ensure the finalizer/finally executes
unconditionally.

---

Duplicate comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Around line 492-500: The pytest invocation in prow_run_smoke_test.sh was
changed to remove test_external_oidc.py, which means EXTERNAL_OIDC/OIDC_*-gated
coverage is no longer run; restore "$test_dir/tests/test_external_oidc.py" to
the list of tests passed to pytest (the same block that currently lists
test_api_keys.py, test_namespace_scoping.py, test_subscription.py,
test_models_endpoint.py) so that test_external_oidc.py is executed (it will
self-skip when EXTERNAL_OIDC is not true); update the pytest command in the
shell script accordingly to include that filename.

In `@test/e2e/tests/test_external_models.py`:
- Around line 261-282: The test test_request_forwarded_returns_200 currently
only checks that the response is not 401/403 which doesn't prove the request
reached the external service; change the assertion to require an actual
successful egress response (e.g. assert r.status_code == 200) or explicitly
allow only known valid upstream statuses (e.g. 200 or an expected 404 from the
external host) and update the test docstring/name if you broaden the accepted
statuses; modify the assertion around the requests.post call that uses
external_models_setup and EXTERNAL_MODEL_NAME to enforce the specific expected
status codes.
- Around line 172-173: Replace blind time.sleep calls (e.g., the RECONCILE_WAIT
* 2 at the top and the block around lines 327-335) with condition-based polling:
implement a short-poll loop with a global timeout that repeatedly fetches the
MaaSModelRef resource and asserts on MaaSModelRef.status.phase reaching the
expected value and simultaneously checks Kubernetes for the presence or absence
of the corresponding HTTPRoute resource; fail if the timeout elapses. Use or add
a reusable helper (e.g., wait_for_condition or poll_until) called from the test
to encapsulate retries, sleeps, and timeout, and reference the resource by the
same names used in the test (MaaSModelRef and HTTPRoute) so the assertions fire
only when the reconciler/auth propagation actually completed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 65a7046b-4beb-47f9-bc8e-571e696e4d2a

📥 Commits

Reviewing files that changed from the base of the PR and between 4cdd1b2 and 0027a79.

📒 Files selected for processing (2)
  • test/e2e/scripts/prow_run_smoke_test.sh
  • test/e2e/tests/test_external_models.py

Comment on lines +97 to +203
def external_models_setup(gateway_url, headers, api_keys_base_url):
"""
Create a single ExternalModel CR, MaaSModelRef, AuthPolicy, and
Subscription pointing to an external endpoint. Cleanup after tests.
"""
log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...")

# Create a dummy secret (ExternalModel requires credentialRef)
_apply_cr({
"apiVersion": "v1",
"kind": "Secret",
"metadata": {
"name": f"{EXTERNAL_MODEL_NAME}-api-key",
"namespace": MODEL_NAMESPACE,
},
"type": "Opaque",
"stringData": {"api-key": "e2e-test-key"},
})

# Create ExternalModel CR
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "ExternalModel",
"metadata": {"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE},
"spec": {
"provider": "openai",
"endpoint": EXTERNAL_ENDPOINT,
"credentialRef": {
"name": f"{EXTERNAL_MODEL_NAME}-api-key",
"namespace": MODEL_NAMESPACE,
},
},
})

# Create MaaSModelRef
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "MaaSModelRef",
"metadata": {
"name": EXTERNAL_MODEL_NAME,
"namespace": MODEL_NAMESPACE,
"annotations": {
"maas.opendatahub.io/endpoint": EXTERNAL_ENDPOINT,
"maas.opendatahub.io/provider": "openai",
},
},
"spec": {
"modelRef": {"kind": "ExternalModel", "name": EXTERNAL_MODEL_NAME},
},
})

# Create MaaSAuthPolicy
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "MaaSAuthPolicy",
"metadata": {"name": EXTERNAL_AUTH_POLICY, "namespace": SUBSCRIPTION_NAMESPACE},
"spec": {
"modelRefs": [{"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}],
"subjects": {"groups": [{"name": "system:authenticated"}]},
},
})

# Create MaaSSubscription
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "MaaSSubscription",
"metadata": {"name": EXTERNAL_SUBSCRIPTION, "namespace": SUBSCRIPTION_NAMESPACE},
"spec": {
"owner": {"groups": [{"name": "system:authenticated"}]},
"modelRefs": [
{"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE},
],
},
})

# Wait for reconciler + auth propagation
time.sleep(RECONCILE_WAIT * 2)

# Create API key for tests
log.info("Creating API key for external model tests...")
r = requests.post(
api_keys_base_url,
headers=headers,
json={"name": "e2e-external-model-key", "subscription": EXTERNAL_SUBSCRIPTION},
timeout=30,
verify=TLS_VERIFY,
)
if r.status_code not in (200, 201):
pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")

api_key = r.json().get("key")
log.info(f"API key created: {api_key[:15]}...")

yield {
"api_key": api_key,
"gateway_url": gateway_url,
}

# ── Cleanup ──
log.info("Cleaning up external model test fixtures...")
_delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE)
_delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE)
_patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE,
{"metadata": {"finalizers": []}})
_delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
_delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
_delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For pytest 8.0.0, if a fixture implemented with yieldraises before reaching theyield, does pytest execute the teardown code after yield?

💡 Result:

No, if a fixture implemented with yield raises an exception before reaching the yield statement, pytest does not execute the teardown code after the yield.

Citations:


Fixture cleanup is skipped when setup fails before yield—resources leak into cluster.

At line 186, pytest.fail() raises before yield is reached. Since pytest does not execute teardown code after yield if an exception occurs before reaching it, the Secret, ExternalModel, MaaSModelRef, auth policy, and subscription are never cleaned up. This poisons retries and pollutes the cluster state.

Use pytest.fixture with addfinalizer() or wrap setup in a try/finally to ensure cleanup runs regardless of early failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 97 - 203, The
external_models_setup fixture can leak resources if an exception occurs before
yield (e.g., when pytest.fail() is called after API key creation fails); wrap
the setup sequence in a try/finally (or register a finalizer with
pytest.fixture.addfinalizer) so that cleanup calls (_delete_cr, _patch_cr for
"maasmodelref", and deletion of the Secret, ExternalModel, MaaSAuthPolicy,
MaaSSubscription) always run even if external_models_setup raises before yield;
locate and protect the block that calls _apply_cr, the requests.post that may
call pytest.fail, and the yield within the external_models_setup function to
ensure the finalizer/finally executes unconditionally.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup test already uses try/finally to prevent resource leaks. The main fixture's cleanup section runs via yield teardown, which pytest guarantees even on test failures.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noyitz, there's an important distinction here that the response conflates:

  1. Test failures (after yield) — pytest does guarantee the teardown block runs. ✅
  2. Fixture setup exceptions (before yield) — pytest does not execute the teardown block after yield. ✗

The concern is specifically about scenario 2. In this fixture, pytest.fail() is raised at Line 186, which is before the yield. If this path is hit, the Secret, ExternalModel, MaaSModelRef, MaaSAuthPolicy, and MaaSSubscription already applied by _apply_cr are never cleaned up.

This is confirmed by pytest's own documentation and the earlier web search result. The cleanup section after yield only executes if yield is actually reached.

A minimal fix wraps the setup in try/finally:

`@pytest.fixture`(scope="module")
def external_models_setup(gateway_url, headers, api_keys_base_url):
    api_key = None
    try:
        # ... all _apply_cr calls ...
        time.sleep(RECONCILE_WAIT * 2)

        r = requests.post(...)
        if r.status_code not in (200, 201):
            pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")

        api_key = r.json().get("key")
        yield {"api_key": api_key, "gateway_url": gateway_url}
    finally:
        log.info("Cleaning up external model test fixtures...")
        _delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE)
        _delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE)
        _patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE,
                  {"metadata": {"finalizers": []}})
        _delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
        _delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
        _delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE)

The try/finally block ensures cleanup runs unconditionally — whether the pytest.fail() path is hit during setup, a test fails after yield, or everything succeeds normally.

@noyitz noyitz force-pushed the feat/external-model-e2e-tests branch from 0027a79 to e3ae602 Compare March 31, 2026 16:48
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
test/e2e/tests/test_external_models.py (4)

104-203: ⚠️ Potential issue | 🟠 Major

Make the fixture cleanup unconditional.

If any setup step fails before yield—including an _apply_cr(...) error, a requests.post(...) exception, or pytest.fail() on Lines 184-185—the teardown below is skipped and the Secret, ExternalModel, MaaSModelRef, AuthPolicy, and Subscription are left behind. Wrap the whole setup/yield in try/finally or register a finalizer before creating the first resource.

Suggested fix
 `@pytest.fixture`(scope="module")
 def external_models_setup(gateway_url, headers, api_keys_base_url):
     """
     Create a single ExternalModel CR, MaaSModelRef, AuthPolicy, and
     Subscription pointing to an external endpoint. Cleanup after tests.
     """
-    log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...")
-
-    # Create a dummy secret (ExternalModel requires credentialRef)
-    _apply_cr({
-        ...
-    })
-
-    # Create ExternalModel CR
-    _apply_cr({
-        ...
-    })
-
-    # Create MaaSModelRef
-    _apply_cr({
-        ...
-    })
-
-    # Create MaaSAuthPolicy
-    _apply_cr({
-        ...
-    })
-
-    # Create MaaSSubscription
-    _apply_cr({
-        ...
-    })
-
-    # Wait for reconciler + auth propagation
-    time.sleep(RECONCILE_WAIT * 2)
-
-    # Create API key for tests
-    log.info("Creating API key for external model tests...")
-    r = requests.post(
-        api_keys_base_url,
-        headers=headers,
-        json={"name": "e2e-external-model-key", "subscription": EXTERNAL_SUBSCRIPTION},
-        timeout=30,
-        verify=TLS_VERIFY,
-    )
-    if r.status_code not in (200, 201):
-        pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")
-
-    api_key = r.json().get("key")
-    log.info(f"API key created: {api_key[:15]}...")
-
-    yield {
-        "api_key": api_key,
-        "gateway_url": gateway_url,
-    }
-
-    # ── Cleanup ──
-    log.info("Cleaning up external model test fixtures...")
-    _delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE)
-    _delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE)
-    _patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE,
-              {"metadata": {"finalizers": []}})
-    _delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
-    _delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
-    _delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE)
+    try:
+        log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...")
+        _apply_cr({...})
+        _apply_cr({...})
+        _apply_cr({...})
+        _apply_cr({...})
+        _apply_cr({...})
+        time.sleep(RECONCILE_WAIT * 2)
+
+        log.info("Creating API key for external model tests...")
+        r = requests.post(
+            api_keys_base_url,
+            headers=headers,
+            json={"name": "e2e-external-model-key", "subscription": EXTERNAL_SUBSCRIPTION},
+            timeout=30,
+            verify=TLS_VERIFY,
+        )
+        if r.status_code not in (200, 201):
+            pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")
+
+        api_key = r.json().get("key")
+        yield {"api_key": api_key, "gateway_url": gateway_url}
+    finally:
+        log.info("Cleaning up external model test fixtures...")
+        _delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE)
+        _delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE)
+        _patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE,
+                  {"metadata": {"finalizers": []}})
+        _delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
+        _delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
+        _delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE)

As per coding guidelines, **: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 104 - 203, The fixture
currently performs multiple _apply_cr calls, requests.post and a yield without
guaranteeing cleanup if setup fails; wrap the entire setup/yield sequence in a
try/finally (or register a pytest finalizer) so the cleanup calls (_delete_cr,
_patch_cr) always run even on exceptions or pytest.fail(), and ensure api_key
creation (requests.post) errors are allowed to bubble to the finally so
resources are removed; locate the setup block with _apply_cr, the API key
creation using requests.post, the yield returning {"api_key": api_key,
"gateway_url": gateway_url}, and the cleanup calls (_delete_cr/_patch_cr) and
move them into the finally (or into the registered finalizer) to guarantee
unconditional teardown.

274-282: ⚠️ Potential issue | 🟠 Major

A non-401/403 response does not prove egress happened.

404, 500, or 503 can still come from the gateway or an intermediate hop, so this test can pass without ever validating external routing. Assert on a response characteristic unique to the configured endpoint, or switch the test endpoint to one with a stable success contract and validate that instead.

As per coding guidelines, **: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 274 - 282, The current
check that r.status_code not in (401, 403) can be fooled by gateway-generated
404/500/503; update the requests.post block to target a stable httpbin endpoint
(e.g. use the /get path) and assert a response characteristic unique to that
endpoint: require r.status_code == 200 and validate the returned payload (e.g.
call r.json() and assert a field like "url" or "origin" matches expectations) so
the test proves the request reached the external service; locate the
requests.post call and variables (r, url, headers, body, TLS_VERIFY, log) and
replace the loose auth-only assertion with the precise status and payload
checks.

44-53: ⚠️ Potential issue | 🟠 Major

Stop returning a soft failure from _apply_cr.

Every caller assumes the resource exists after this helper returns. Logging and returning False lets setup continue in a partial cluster state and turns the first real failure into a cascade of unrelated assertions.

Suggested fix
 def _apply_cr(cr_dict: dict):
     """Apply a Kubernetes CR from a dict."""
     result = subprocess.run(
         ["oc", "apply", "-f", "-"],
         input=json.dumps(cr_dict),
         capture_output=True, text=True,
     )
     if result.returncode != 0:
-        log.warning(f"oc apply failed: {result.stderr}")
-    return result.returncode == 0
+        meta = cr_dict.get("metadata", {})
+        ident = f"{cr_dict.get('kind')}/{meta.get('namespace', '<unknown>')}/{meta.get('name', '<unknown>')}"
+        raise RuntimeError(f"oc apply failed for {ident}: {result.stderr.strip()}")
+    return True

As per coding guidelines, **: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 44 - 53, The helper
_apply_cr currently logs failures and returns False which lets callers assume
the resource exists; change it to fail hard by raising an exception when oc
apply returns non-zero: after running subprocess.run in _apply_cr, if
result.returncode != 0 raise a RuntimeError or subprocess.CalledProcessError
that includes the failing command, result.returncode and result.stderr (and/or
stdout) so callers cannot proceed on partial state; keep the function name
_apply_cr and preserve the subprocess.run invocation but replace the
soft-logging path with a thrown error that surfaces the oc output.

173-173: ⚠️ Potential issue | 🟠 Major

Replace guessed sleeps with condition-based waits.

These sleeps make the suite timing-dependent: slow reconciles fail spuriously, fast reconciles just waste CI time. Poll for the actual conditions you assert later (route/service created, route removed) instead of sleeping RECONCILE_WAIT * N.

As per coding guidelines, **: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.

Also applies to: 327-335

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` at line 173, Replace the fixed sleep
calls (time.sleep(RECONCILE_WAIT * 2)) with condition-based polling using the
existing reconcile timeout constant (RECONCILE_WAIT) and the actual checks used
later in the test (e.g., existence/absence of the route or service); implement a
small wait loop that polls the cluster/client for the expected condition
(route/service created or removed) with a reasonable poll interval and a total
timeout (like a multiple of RECONCILE_WAIT), and fail the test if the timeout
elapses; apply the same replacement for the other occurrences referenced (around
lines 327-335) so tests wait for the real condition instead of sleeping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 187-188: The test currently logs secret material via the api_key
variable (api_key = r.json().get("key") and log.info(f"API key created:
{api_key[:15]}...")); stop emitting any part of the secret to logs — replace
that log call with a non-secret placeholder or a safe identifier from the
response (e.g., use r.json().get("id") or log "API key created: <REDACTED>") so
no credential bytes are written to CI artifacts, and ensure only non-secret
fields are logged.
- Around line 72-80: The helper _get_cr currently treats any non-zero oc exit as
“not found”; change it so that when subprocess.run returns non-zero you inspect
result.stderr (and/or result.stdout) for a Kubernetes NotFound indicator (e.g.,
contains "NotFound" or "not found") and only return None in that case; for any
other error (transport/auth/CLI failure) raise an exception (RuntimeError or
subprocess.CalledProcessError) including the command, returncode and stderr so
callers can distinguish API/CLI failures from true NotFound. Ensure the change
is made inside the _get_cr function and preserves the existing successful
json.loads(result.stdout) path.
- Around line 46-50: The subprocess.run calls invoking "oc" in
test/e2e/tests/test_external_models.py currently lack a subprocess-level timeout
causing potential hangs; update each oc helper subprocess.run invocation (the
calls shown around the variables result = subprocess.run(...), the ones used to
apply CRs and any subsequent oc commands in the same test such as the calls at
lines roughly 46–50, 58–61, 66–69, 74–77) to include a sensible timeout=
parameter (e.g., timeout=30) so the Python process will raise a TimeoutExpired
instead of hanging; ensure you add the timeout argument to each subprocess.run
call that runs "oc" and handle or let the exception propagate consistently with
existing test/fixture cleanup behavior.

---

Duplicate comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 104-203: The fixture currently performs multiple _apply_cr calls,
requests.post and a yield without guaranteeing cleanup if setup fails; wrap the
entire setup/yield sequence in a try/finally (or register a pytest finalizer) so
the cleanup calls (_delete_cr, _patch_cr) always run even on exceptions or
pytest.fail(), and ensure api_key creation (requests.post) errors are allowed to
bubble to the finally so resources are removed; locate the setup block with
_apply_cr, the API key creation using requests.post, the yield returning
{"api_key": api_key, "gateway_url": gateway_url}, and the cleanup calls
(_delete_cr/_patch_cr) and move them into the finally (or into the registered
finalizer) to guarantee unconditional teardown.
- Around line 274-282: The current check that r.status_code not in (401, 403)
can be fooled by gateway-generated 404/500/503; update the requests.post block
to target a stable httpbin endpoint (e.g. use the /get path) and assert a
response characteristic unique to that endpoint: require r.status_code == 200
and validate the returned payload (e.g. call r.json() and assert a field like
"url" or "origin" matches expectations) so the test proves the request reached
the external service; locate the requests.post call and variables (r, url,
headers, body, TLS_VERIFY, log) and replace the loose auth-only assertion with
the precise status and payload checks.
- Around line 44-53: The helper _apply_cr currently logs failures and returns
False which lets callers assume the resource exists; change it to fail hard by
raising an exception when oc apply returns non-zero: after running
subprocess.run in _apply_cr, if result.returncode != 0 raise a RuntimeError or
subprocess.CalledProcessError that includes the failing command,
result.returncode and result.stderr (and/or stdout) so callers cannot proceed on
partial state; keep the function name _apply_cr and preserve the subprocess.run
invocation but replace the soft-logging path with a thrown error that surfaces
the oc output.
- Line 173: Replace the fixed sleep calls (time.sleep(RECONCILE_WAIT * 2)) with
condition-based polling using the existing reconcile timeout constant
(RECONCILE_WAIT) and the actual checks used later in the test (e.g.,
existence/absence of the route or service); implement a small wait loop that
polls the cluster/client for the expected condition (route/service created or
removed) with a reasonable poll interval and a total timeout (like a multiple of
RECONCILE_WAIT), and fail the test if the timeout elapses; apply the same
replacement for the other occurrences referenced (around lines 327-335) so tests
wait for the real condition instead of sleeping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: d4d0a86d-2ab7-4e4e-8e71-572891bef056

📥 Commits

Reviewing files that changed from the base of the PR and between 0027a79 and e3ae602.

📒 Files selected for processing (2)
  • test/e2e/scripts/prow_run_smoke_test.sh
  • test/e2e/tests/test_external_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/scripts/prow_run_smoke_test.sh

Comment on lines +46 to +50
result = subprocess.run(
["oc", "apply", "-f", "-"],
input=json.dumps(cr_dict),
capture_output=True, text=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l test/e2e/tests/test_external_models.py

Repository: opendatahub-io/models-as-a-service

Length of output: 120


🏁 Script executed:

cat -n test/e2e/tests/test_external_models.py | sed -n '40,80p'

Repository: opendatahub-io/models-as-a-service

Length of output: 1695


Add subprocess timeouts to all oc helper calls.

These external process calls lack timeout= parameters on subprocess.run(). If oc or the Kubernetes API server stalls, the process hangs at the OS level and the fixture never reaches cleanup, deadlocking the entire E2E job. Note that the --timeout=30s flag in _delete_cr is a command-level argument to oc delete, not a subprocess timeout.

Apply to: lines 46–50, 58–61, 66–69, 74–77.

🧰 Tools
🪛 Ruff (0.15.7)

[error] 47-47: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 46 - 50, The
subprocess.run calls invoking "oc" in test/e2e/tests/test_external_models.py
currently lack a subprocess-level timeout causing potential hangs; update each
oc helper subprocess.run invocation (the calls shown around the variables result
= subprocess.run(...), the ones used to apply CRs and any subsequent oc commands
in the same test such as the calls at lines roughly 46–50, 58–61, 66–69, 74–77)
to include a sensible timeout= parameter (e.g., timeout=30) so the Python
process will raise a TimeoutExpired instead of hanging; ensure you add the
timeout argument to each subprocess.run call that runs "oc" and handle or let
the exception propagate consistently with existing test/fixture cleanup
behavior.

Comment on lines +72 to +80
def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]:
"""Get a Kubernetes resource as dict, or None if not found."""
result = subprocess.run(
["oc", "get", kind, name, "-n", namespace, "-o", "json"],
capture_output=True, text=True,
)
if result.returncode != 0:
return None
return json.loads(result.stdout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not collapse every oc get failure into “not found.”

A transport/auth/CLI error returns None here exactly like a real NotFound. That can make discovery tests report missing resources and cleanup tests falsely pass when the API server is unreachable. Return None only for NotFound; raise on everything else.

Suggested fix
 def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]:
     """Get a Kubernetes resource as dict, or None if not found."""
     result = subprocess.run(
         ["oc", "get", kind, name, "-n", namespace, "-o", "json"],
         capture_output=True, text=True,
     )
     if result.returncode != 0:
-        return None
+        stderr = (result.stderr or "").lower()
+        if "notfound" in stderr or "not found" in stderr:
+            return None
+        raise RuntimeError(
+            f"oc get failed for {kind}/{namespace}/{name}: {result.stderr.strip() or result.stdout.strip()}"
+        )
     return json.loads(result.stdout)

As per coding guidelines, **: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.

🧰 Tools
🪛 Ruff (0.15.7)

[error] 74-74: subprocess call: check for execution of untrusted input

(S603)


[error] 75-75: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 72 - 80, The helper
_get_cr currently treats any non-zero oc exit as “not found”; change it so that
when subprocess.run returns non-zero you inspect result.stderr (and/or
result.stdout) for a Kubernetes NotFound indicator (e.g., contains "NotFound" or
"not found") and only return None in that case; for any other error
(transport/auth/CLI failure) raise an exception (RuntimeError or
subprocess.CalledProcessError) including the command, returncode and stderr so
callers can distinguish API/CLI failures from true NotFound. Ensure the change
is made inside the _get_cr function and preserves the existing successful
json.loads(result.stdout) path.

Copy link
Copy Markdown
Contributor

@jland-redhat jland-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we revisit the possibly deleted test.

Could we also add some logic that validates we have access to the httpbin.org before we run the E2E test and just print a warning if the connection does not work

We may run these on a disconnected environment, and I would like to eventually make like a backup plan here in the future incase that endpoint is unavailable.

"$test_dir/tests/test_namespace_scoping.py" \
"$test_dir/tests/test_subscription.py" \
"$test_dir/tests/test_models_endpoint.py" \
"$test_dir/tests/test_external_oidc.py" ; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure I just viewed the file on your repo and I don't see this change. Can we make sure to merge to main so we can be sure this test does not get lost

Image

The reconciler overwrites the DestinationRule on reconciliation, removing
the insecureSkipVerify patch needed for the simulator's self-signed cert.
Added an autouse fixture that re-patches before each test class.
@noyitz noyitz force-pushed the feat/external-model-e2e-tests branch from e3ae602 to 16508d6 Compare March 31, 2026 18:28
@noyitz
Copy link
Copy Markdown
Collaborator Author

noyitz commented Mar 31, 2026

@jland-redhat — thanks for the review! Addressed both points:

1. Connectivity check for httpbin.org:
Added a pytestmark skip condition that checks if the external endpoint is reachable before running the tests. In disconnected environments, the tests will be skipped with a clear message instead of failing.

2. test_models_endpoint.py:
It's still in the test list — the diff looks like a removal because test_external_oidc.py (which was after it) was removed by upstream PR #598. In the updated version, test_models_endpoint.py is still there, and test_external_models.py is added right after it in the same pytest invocation. No separate test block needed.

Also merged the external model tests into the main pytest invocation (no longer a separate block with separate artifacts), so they run alongside all other tests and can't get lost.

@noyitz
Copy link
Copy Markdown
Collaborator Author

noyitz commented Mar 31, 2026

@srampal — these tests validate MaaS egress capabilities (discovery, auth, routing to an external endpoint) using httpbin.org. They don't test provider-specific inference or real cloud services — that's covered by the ai-gateway-payload-processing repo's E2E tests, which test the full plugin chain (request/response translation) against a simulator for all providers (OpenAI, Anthropic, Azure, Vertex).

These MaaS tests focus on: does the ExternalModel reconciler create the right resources? Does auth block unauthenticated requests? Does a request with a valid key reach the external endpoint? Does cleanup work?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
test/e2e/tests/test_external_models.py (2)

213-214: ⚠️ Potential issue | 🟠 Major

Stop logging API key material (CWE-532).

api_key[:15] writes 15 characters of live credential to CI logs and archived artifacts. Anyone with log access gains partial secret material. Log only a constant placeholder or the key name.

Proposed fix
     api_key = r.json().get("key")
-    log.info(f"API key created: {api_key[:15]}...")
+    log.info("API key created for e2e-external-model-key")

As per coding guidelines, **: REVIEW PRIORITIES: 1. Security vulnerabilities.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 213 - 214, The test
currently logs secret material via the api_key variable (see api_key =
r.json().get("key") and the log.info call); remove any substring of the real key
from logs and replace the log message with a constant placeholder or the key
identifier (e.g., "API key created: <REDACTED>" or "API key created for
test-user") so no live credential characters are written to CI/artifacts; update
the log.info invocation that references api_key to use the placeholder instead.

72-80: ⚠️ Potential issue | 🟠 Major

Distinguish NotFound from transport/auth errors in _get_cr.

Returning None for any non-zero exit conflates "resource doesn't exist" with "cluster unreachable" or "auth failed". Discovery tests will report missing resources and cleanup tests will falsely pass when the API server is down.

Proposed fix
 def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]:
     """Get a Kubernetes resource as dict, or None if not found."""
     result = subprocess.run(
         ["oc", "get", kind, name, "-n", namespace, "-o", "json"],
-        capture_output=True, text=True,
+        capture_output=True, text=True, timeout=30,
     )
     if result.returncode != 0:
+        stderr = (result.stderr or "").lower()
+        if "notfound" in stderr or "not found" in stderr:
+            return None
+        raise RuntimeError(
+            f"oc get {kind}/{name} in {namespace} failed: {result.stderr.strip()}"
+        )
-        return None
     return json.loads(result.stdout)

As per coding guidelines, **: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 72 - 80, The helper
_get_cr currently returns None for any non-zero exit from the subprocess.run
call, which hides transport/auth errors; change its handling so that after
running the oc command (the subprocess.run in _get_cr) you inspect
result.returncode and result.stderr: if the stderr clearly indicates the
resource is missing (e.g., contains "NotFound" or "not found"), return None, but
for any other non-zero exit treat it as an operational error (raise a
RuntimeError or re-raise a CalledProcessError with the command, returncode and
stderr) so callers can distinguish cluster/auth problems from absent resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 44-53: Subprocess calls to `oc` in helper functions can hang
because they lack a Python-level timeout; update `_apply_cr` (and similarly
`_delete_cr`, `_patch_cr`, `_get_cr`) to pass a `timeout=` argument to
`subprocess.run()` (e.g., 30s) and handle `subprocess.TimeoutExpired` by logging
the timeout error (using existing `log.warning` or similar) and returning
failure (False or non-zero) so fixtures/cleanup don’t deadlock; ensure the
behavior on normal non-zero return codes remains unchanged.
- Around line 122-229: The fixture external_models_setup can leak applied
resources if setup fails before yield; wrap the entire setup and API key
creation in a try/finally (or register a pytest request.addfinalizer) so the
cleanup block that uses _delete_cr and _patch_cr always runs even on exceptions;
specifically, perform all _apply_cr calls and the requests.post inside a try,
assign the api_key/gateway_url to a local result dict, yield that dict inside
the try, and move the existing cleanup calls (_delete_cr("maasauthpolicy", ...),
_delete_cr("maassubscription", ...), _patch_cr("maasmodelref", ...),
_delete_cr("maasmodelref", ...), _delete_cr("externalmodel", ...),
_delete_cr("secret", ...)) into the finally block so they execute whether
pytest.fail()/exceptions occur before or after yield.

---

Duplicate comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 213-214: The test currently logs secret material via the api_key
variable (see api_key = r.json().get("key") and the log.info call); remove any
substring of the real key from logs and replace the log message with a constant
placeholder or the key identifier (e.g., "API key created: <REDACTED>" or "API
key created for test-user") so no live credential characters are written to
CI/artifacts; update the log.info invocation that references api_key to use the
placeholder instead.
- Around line 72-80: The helper _get_cr currently returns None for any non-zero
exit from the subprocess.run call, which hides transport/auth errors; change its
handling so that after running the oc command (the subprocess.run in _get_cr)
you inspect result.returncode and result.stderr: if the stderr clearly indicates
the resource is missing (e.g., contains "NotFound" or "not found"), return None,
but for any other non-zero exit treat it as an operational error (raise a
RuntimeError or re-raise a CalledProcessError with the command, returncode and
stderr) so callers can distinguish cluster/auth problems from absent resources.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 74438079-7a7a-49bc-848c-81d41c353a1e

📥 Commits

Reviewing files that changed from the base of the PR and between e3ae602 and 16508d6.

📒 Files selected for processing (2)
  • test/e2e/scripts/prow_run_smoke_test.sh
  • test/e2e/tests/test_external_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/scripts/prow_run_smoke_test.sh

Comment on lines +44 to +53
def _apply_cr(cr_dict: dict):
"""Apply a Kubernetes CR from a dict."""
result = subprocess.run(
["oc", "apply", "-f", "-"],
input=json.dumps(cr_dict),
capture_output=True, text=True,
)
if result.returncode != 0:
log.warning(f"oc apply failed: {result.stderr}")
return result.returncode == 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add subprocess timeout to prevent CI job deadlock.

All subprocess.run() calls to oc lack a timeout= parameter. If the Kubernetes API server stalls or oc hangs, the fixture never reaches cleanup and the entire E2E job deadlocks. The --timeout=30s flag in _delete_cr is a command-level argument to oc delete, not a Python subprocess timeout.

Proposed fix for all helpers
 def _apply_cr(cr_dict: dict):
     """Apply a Kubernetes CR from a dict."""
     result = subprocess.run(
         ["oc", "apply", "-f", "-"],
         input=json.dumps(cr_dict),
-        capture_output=True, text=True,
+        capture_output=True, text=True, timeout=60,
     )

Apply similar changes to _delete_cr, _patch_cr, and _get_cr.

🧰 Tools
🪛 Ruff (0.15.7)

[error] 47-47: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 44 - 53, Subprocess
calls to `oc` in helper functions can hang because they lack a Python-level
timeout; update `_apply_cr` (and similarly `_delete_cr`, `_patch_cr`, `_get_cr`)
to pass a `timeout=` argument to `subprocess.run()` (e.g., 30s) and handle
`subprocess.TimeoutExpired` by logging the timeout error (using existing
`log.warning` or similar) and returning failure (False or non-zero) so
fixtures/cleanup don’t deadlock; ensure the behavior on normal non-zero return
codes remains unchanged.

Comment on lines +122 to +229
@pytest.fixture(scope="module")
def external_models_setup(gateway_url, headers, api_keys_base_url):
"""
Create a single ExternalModel CR, MaaSModelRef, AuthPolicy, and
Subscription pointing to an external endpoint. Cleanup after tests.
"""
log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...")

# Create a dummy secret (ExternalModel requires credentialRef)
_apply_cr({
"apiVersion": "v1",
"kind": "Secret",
"metadata": {
"name": f"{EXTERNAL_MODEL_NAME}-api-key",
"namespace": MODEL_NAMESPACE,
},
"type": "Opaque",
"stringData": {"api-key": "e2e-test-key"},
})

# Create ExternalModel CR
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "ExternalModel",
"metadata": {"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE},
"spec": {
"provider": "openai",
"endpoint": EXTERNAL_ENDPOINT,
"credentialRef": {
"name": f"{EXTERNAL_MODEL_NAME}-api-key",
"namespace": MODEL_NAMESPACE,
},
},
})

# Create MaaSModelRef
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "MaaSModelRef",
"metadata": {
"name": EXTERNAL_MODEL_NAME,
"namespace": MODEL_NAMESPACE,
"annotations": {
"maas.opendatahub.io/endpoint": EXTERNAL_ENDPOINT,
"maas.opendatahub.io/provider": "openai",
},
},
"spec": {
"modelRef": {"kind": "ExternalModel", "name": EXTERNAL_MODEL_NAME},
},
})

# Create MaaSAuthPolicy
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "MaaSAuthPolicy",
"metadata": {"name": EXTERNAL_AUTH_POLICY, "namespace": SUBSCRIPTION_NAMESPACE},
"spec": {
"modelRefs": [{"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}],
"subjects": {"groups": [{"name": "system:authenticated"}]},
},
})

# Create MaaSSubscription
_apply_cr({
"apiVersion": "maas.opendatahub.io/v1alpha1",
"kind": "MaaSSubscription",
"metadata": {"name": EXTERNAL_SUBSCRIPTION, "namespace": SUBSCRIPTION_NAMESPACE},
"spec": {
"owner": {"groups": [{"name": "system:authenticated"}]},
"modelRefs": [
{"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE},
],
},
})

# Wait for reconciler + auth propagation
time.sleep(RECONCILE_WAIT * 2)

# Create API key for tests
log.info("Creating API key for external model tests...")
r = requests.post(
api_keys_base_url,
headers=headers,
json={"name": "e2e-external-model-key", "subscription": EXTERNAL_SUBSCRIPTION},
timeout=30,
verify=TLS_VERIFY,
)
if r.status_code not in (200, 201):
pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")

api_key = r.json().get("key")
log.info(f"API key created: {api_key[:15]}...")

yield {
"api_key": api_key,
"gateway_url": gateway_url,
}

# ── Cleanup ──
log.info("Cleaning up external model test fixtures...")
_delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE)
_delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE)
_patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE,
{"metadata": {"finalizers": []}})
_delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
_delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
_delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fixture leaks resources if setup fails before yield.

If pytest.fail() at line 211 is triggered (API key creation fails), the Secret, ExternalModel, MaaSModelRef, MaaSAuthPolicy, and MaaSSubscription already applied by _apply_cr are never cleaned up. Pytest does not execute teardown code after yield when an exception occurs before reaching yield.

Proposed fix using try/finally
 `@pytest.fixture`(scope="module")
 def external_models_setup(gateway_url, headers, api_keys_base_url):
     log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...")
+    api_key = None
+    try:
+        # Create a dummy secret (ExternalModel requires credentialRef)
+        _apply_cr({...})
+        # ... all other _apply_cr calls ...
 
-    # Wait for reconciler + auth propagation
-    time.sleep(RECONCILE_WAIT * 2)
+        time.sleep(RECONCILE_WAIT * 2)
 
-    # Create API key for tests
-    log.info("Creating API key for external model tests...")
-    r = requests.post(...)
-    if r.status_code not in (200, 201):
-        pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")
+        r = requests.post(...)
+        if r.status_code not in (200, 201):
+            pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")
 
-    api_key = r.json().get("key")
-    log.info(f"API key created: {api_key[:15]}...")
+        api_key = r.json().get("key")
+        log.info("API key created for e2e-external-model-key")
 
-    yield {...}
+        yield {"api_key": api_key, "gateway_url": gateway_url}
+    finally:
+        log.info("Cleaning up external model test fixtures...")
+        _delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE)
+        _delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE)
+        _patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE,
+                  {"metadata": {"finalizers": []}})
+        _delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
+        _delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
+        _delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE)
 
-    # ── Cleanup ──
-    log.info("Cleaning up external model test fixtures...")
-    ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/tests/test_external_models.py` around lines 122 - 229, The fixture
external_models_setup can leak applied resources if setup fails before yield;
wrap the entire setup and API key creation in a try/finally (or register a
pytest request.addfinalizer) so the cleanup block that uses _delete_cr and
_patch_cr always runs even on exceptions; specifically, perform all _apply_cr
calls and the requests.post inside a try, assign the api_key/gateway_url to a
local result dict, yield that dict inside the try, and move the existing cleanup
calls (_delete_cr("maasauthpolicy", ...), _delete_cr("maassubscription", ...),
_patch_cr("maasmodelref", ...), _delete_cr("maasmodelref", ...),
_delete_cr("externalmodel", ...), _delete_cr("secret", ...)) into the finally
block so they execute whether pytest.fail()/exceptions occur before or after
yield.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 31, 2026

@noyitz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-test-maas 16508d6 link false /test e2e-test-maas

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants