feat: add E2E tests for external models (egress)#632
feat: add E2E tests for external models (egress)#632noyitz wants to merge 3 commits intoopendatahub-io:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: noyitz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new end-to-end pytest module Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Security findings
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
test/e2e/tests/test_external_models.py (1)
228-231: Forcibly clearing finalizers bypasses controller cleanup.Patching
finalizers: []before deletion prevents the reconciler from running its cleanup logic. This could leave orphaned HTTPRoutes or Services if the test environment is reused. Consider deleting without finalizer removal first, with a short timeout, and only force-remove finalizers if deletion hangs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 228 - 231, The test currently force-clears finalizers on the maasmodelref via _patch_cr(p["name"], MODEL_NAMESPACE, {"metadata": {"finalizers": []}}) before calling _delete_cr, which bypasses controller cleanup; change the flow to call _delete_cr("maasmodelref", p["name"], MODEL_NAMESPACE) first and wait briefly for the resource to be removed (use an existing wait/poll helper or short timeout), and only if the resource still exists after the timeout call _patch_cr to clear finalizers and then _delete_cr again; keep references to the same identifiers (p["name"], MODEL_NAMESPACE, _delete_cr, _patch_cr) so the logic is easy to locate and review.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 344-368: The test test_rate_limit_returns_429 is flaky because it
assumes ~10 tokens per response; update the test to be deterministic by either
configuring the external_models_setup fixture to return a fixed token usage for
the chosen provider (PROVIDERS[1]) or, if fixture changes are not possible,
increase the number of requests and add simple retry/backoff logic in the
for-loop that calls the endpoint (use a larger max iterations and sleep between
batches) until a 429 is observed or a higher threshold is reached; modify the
loop that posts to url and the final assertion to reflect the new max attempts
and retry behavior so the test reliably observes a 429 without relying on
assumed token counts.
- Line 38: RECONCILE_WAIT is set using int(os.environ.get(...)) which will raise
a ValueError if E2E_RECONCILE_WAIT contains non-numeric text; update the
initialization in test_external_models.py to safely parse the env var (e.g.,
read the string, try to convert with int() inside a try/except, or use
str.isdigit() validation) and fall back to the default 12 (or log a warning)
when conversion fails so test collection doesn’t crash.
- Around line 214-215: The code logs a slice of api_key without checking for
None, which will raise TypeError if the response lacks "key"; update the test
around the api_key assignment (variable api_key and the log.info call) to first
assert or check that api_key is not None (or truthy), and if it is missing fail
the test with a clear message or log the full response before slicing; only
perform api_key[:15] after confirming api_key exists so the log.info call is
safe.
- Around line 376-424: The test_delete_removes_httproute test can leak the
temporary ExternalModel if an assertion fails because the cleanup call
_delete_cr("externalmodel", temp_name, MODEL_NAMESPACE) is after assertions;
wrap the main test logic (creation, waits, assertions, deletion of MaaSModelRef)
in a try/finally and move the ExternalModel deletion into the finally block (or
use a fixture that yields and always deletes the created ExternalModel) so that
_delete_cr("externalmodel", temp_name, MODEL_NAMESPACE) always runs even when
assertions in test_delete_removes_httproute fail.
- Around line 80-88: The _get_cr function currently calls
json.loads(result.stdout) without guarding against malformed JSON; wrap the JSON
parsing in a try/except catching json.JSONDecodeError (and optionally ValueError
for older Python) and return None (or log the parsing error) when decoding fails
so a non-JSON or partial output from oc doesn't raise; update the function
around the json.loads call to handle the exception and reference _get_cr and
json.loads/JSONDecodeError in your fix.
---
Nitpick comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 228-231: The test currently force-clears finalizers on the
maasmodelref via _patch_cr(p["name"], MODEL_NAMESPACE, {"metadata":
{"finalizers": []}}) before calling _delete_cr, which bypasses controller
cleanup; change the flow to call _delete_cr("maasmodelref", p["name"],
MODEL_NAMESPACE) first and wait briefly for the resource to be removed (use an
existing wait/poll helper or short timeout), and only if the resource still
exists after the timeout call _patch_cr to clear finalizers and then _delete_cr
again; keep references to the same identifiers (p["name"], MODEL_NAMESPACE,
_delete_cr, _patch_cr) so the logic is easy to locate and review.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 797db861-a123-432e-bc0c-32fe0a5ea895
📒 Files selected for processing (1)
test/e2e/tests/test_external_models.py
| SUBSCRIPTION_NAMESPACE = os.environ.get("E2E_SUBSCRIPTION_NAMESPACE", "models-as-a-service") | ||
| EXTERNAL_SUBSCRIPTION = os.environ.get("E2E_EXTERNAL_SUBSCRIPTION", "e2e-external-subscription") | ||
| EXTERNAL_AUTH_POLICY = os.environ.get("E2E_EXTERNAL_AUTH_POLICY", "e2e-external-access") | ||
| RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12")) |
There was a problem hiding this comment.
Uncaught ValueError if E2E_RECONCILE_WAIT is non-numeric.
If the environment variable contains a non-integer string, int() will raise ValueError, crashing test collection.
Proposed fix
-RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12"))
+try:
+ RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12"))
+except ValueError:
+ RECONCILE_WAIT = 12📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12")) | |
| try: | |
| RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12")) | |
| except ValueError: | |
| RECONCILE_WAIT = 12 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` at line 38, RECONCILE_WAIT is set
using int(os.environ.get(...)) which will raise a ValueError if
E2E_RECONCILE_WAIT contains non-numeric text; update the initialization in
test_external_models.py to safely parse the env var (e.g., read the string, try
to convert with int() inside a try/except, or use str.isdigit() validation) and
fall back to the default 12 (or log a warning) when conversion fails so test
collection doesn’t crash.
There was a problem hiding this comment.
Still applicable — but this is an E2E test script, not production code. If the env var is wrong, crashing early with a clear ValueError is the right behavior.
There was a problem hiding this comment.
@noyitz — fair point. For E2E test infrastructure, fail-fast is the right call: silently falling back to 12 could mask a misconfigured env var and produce misleading results. A hard ValueError surfaces the problem immediately at collection time. Withdrawing the suggestion.
| def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]: | ||
| """Get a Kubernetes resource as dict, or None if not found.""" | ||
| result = subprocess.run( | ||
| ["oc", "get", kind, name, "-n", namespace, "-o", "json"], | ||
| capture_output=True, text=True, | ||
| ) | ||
| if result.returncode != 0: | ||
| return None | ||
| return json.loads(result.stdout) |
There was a problem hiding this comment.
json.loads can raise JSONDecodeError on malformed output.
If oc get returns non-JSON output (e.g., partial output on timeout, error message despite exit code 0), json.loads will raise an unhandled exception.
Proposed fix
def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]:
"""Get a Kubernetes resource as dict, or None if not found."""
result = subprocess.run(
["oc", "get", kind, name, "-n", namespace, "-o", "json"],
capture_output=True, text=True,
)
if result.returncode != 0:
return None
- return json.loads(result.stdout)
+ try:
+ return json.loads(result.stdout)
+ except json.JSONDecodeError:
+ log.warning(f"Failed to parse JSON for {kind}/{name}: {result.stdout[:100]}")
+ return None🧰 Tools
🪛 Ruff (0.15.7)
[error] 82-82: subprocess call: check for execution of untrusted input
(S603)
[error] 83-83: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 80 - 88, The _get_cr
function currently calls json.loads(result.stdout) without guarding against
malformed JSON; wrap the JSON parsing in a try/except catching
json.JSONDecodeError (and optionally ValueError for older Python) and return
None (or log the parsing error) when decoding fails so a non-JSON or partial
output from oc doesn't raise; update the function around the json.loads call to
handle the exception and reference _get_cr and json.loads/JSONDecodeError in
your fix.
There was a problem hiding this comment.
Fair point, but _get_cr already checks returncode != 0 and returns None. If oc get exits 0 with non-JSON output, that's a cluster bug. Keeping it simple for E2E tests.
There was a problem hiding this comment.
@noyitz, fair enough — oc get -o json exiting 0 with non-JSON output is indeed outside the contract of the tool, and hardening against cluster-level defects in E2E test helpers adds noise without real benefit. Acknowledged.
| api_key = r.json().get("key") | ||
| log.info(f"API key created: {api_key[:15]}...") |
There was a problem hiding this comment.
Missing null check on api_key before slicing.
If the response JSON lacks a key field, api_key will be None, and api_key[:15] will raise TypeError. The test will fail with an unclear error.
Proposed fix
api_key = r.json().get("key")
+ if not api_key:
+ pytest.fail(f"API key response missing 'key' field: {r.json()}")
- log.info(f"API key created: {api_key[:15]}...")
+ log.info(f"API key created: {api_key[:15]}...")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| api_key = r.json().get("key") | |
| log.info(f"API key created: {api_key[:15]}...") | |
| api_key = r.json().get("key") | |
| if not api_key: | |
| pytest.fail(f"API key response missing 'key' field: {r.json()}") | |
| log.info(f"API key created: {api_key[:15]}...") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 214 - 215, The code logs
a slice of api_key without checking for None, which will raise TypeError if the
response lacks "key"; update the test around the api_key assignment (variable
api_key and the log.info call) to first assert or check that api_key is not None
(or truthy), and if it is missing fail the test with a clear message or log the
full response before slicing; only perform api_key[:15] after confirming api_key
exists so the log.info call is safe.
There was a problem hiding this comment.
Good catch — but the line above already calls pytest.fail() if the status code isn't 200/201, so we'd never reach this line with a bad response. The `.get("key")" returning None would be a server bug worth crashing on.
There was a problem hiding this comment.
@noyitz, fair point — the status-code guard does significantly reduce the probability of hitting this path with a malformed response. The residual risk is a 200/201 reply that is well-formed HTTP but has an unexpected JSON body (e.g., a schema change where the key moves to "apiKey" instead of "key"). In that case the TypeError stack trace is harder to diagnose than an explicit message, but that's a minor ergonomic concern rather than a correctness bug.
Acknowledged — happy to leave it as-is if you prefer a loud crash to signal a server contract violation.
Adds test_external_models.py with 29 tests covering: - Discovery: MaaSModelRef, HTTPRoute, backend service creation (5 providers) - Inference: chat completions 200 + valid OpenAI format (5 providers) - Auth: invalid key 401, no key 401 - Rate limiting: token rate limit 429 - Cleanup: deletion removes resources via finalizer All tests parametrized across OpenAI, Anthropic, Azure OpenAI, Vertex AI, and Bedrock providers using a configurable external simulator endpoint. Environment: E2E_SIMULATOR_ENDPOINT (default: 3.150.113.9)
Commented out until BBR plugins are deployed as part of CI. Prerequisites: BBR image, EnvoyFilter, simulator endpoint accessible.
|
How many of these E2E tests were also verified with the corresponding real cloud services vs only with the simulator ? |
6baea2b to
4cdd1b2
Compare
4cdd1b2 to
faba622
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (4)
test/e2e/tests/test_external_models.py (4)
290-341:⚠️ Potential issue | 🟡 MinorTemporary cleanup is not guaranteed when assertions fail.
Wrap create/assert/delete logic in
try/finallyso temporaryMaaSModelRefandExternalModelresources are always deleted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 290 - 341, The test_delete_removes_httproute creates temporary ExternalModel and MaaSModelRef resources but doesn't guarantee cleanup on assertion failure; refactor test_delete_removes_httproute to wrap the creation, verification, and deletion logic in a try/finally block so that _delete_cr is always called for both the MaaSModelRef and the ExternalModel (use temp_name and MODEL_NAMESPACE), keeping the existing _apply_cr, _get_cr, and asserts inside try and moving the two _delete_cr calls into finally to ensure cleanup even on failures.
187-188:⚠️ Potential issue | 🟡 MinorAPI key is sliced without validating presence.
If the response omits
"key", this throwsTypeErrorbefore a clear failure message.Proposed fix
api_key = r.json().get("key") +if not api_key: + pytest.fail(f"API key response missing 'key': {r.text}") log.info(f"API key created: {api_key[:15]}...")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 187 - 188, The test slices api_key without checking it exists (api_key = r.json().get("key"); log.info(f"API key created: {api_key[:15]}...")), which will raise TypeError if "key" is missing; update the test to validate presence first by asserting or raising a clear error when api_key is falsy (e.g., assert api_key is not None or raise RuntimeError with response content), then perform the safe slice for logging (or log the full value only after validation) so failures report the missing key and response details clearly.
72-80:⚠️ Potential issue | 🟡 Minor
_get_crdoes not handle malformed JSON output fromoc get.If stdout is partial/non-JSON,
json.loadsraises and aborts the suite.Proposed fix
def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]: @@ - return json.loads(result.stdout) + try: + return json.loads(result.stdout) + except json.JSONDecodeError: + log.warning(f"Failed to parse JSON for {kind}/{name}: {result.stdout[:200]}") + return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 72 - 80, The _get_cr function currently calls json.loads(result.stdout) unguarded, which crashes the test suite if oc returns partial or malformed JSON; update _get_cr to catch JSON decoding errors (catch json.JSONDecodeError) and return None (or optionally log the stdout/stderr for debugging) when decoding fails, and ensure you still check result.returncode != 0 before parsing; reference the _get_cr function and the json.loads call when making the change.
36-36:⚠️ Potential issue | 🟡 Minor
E2E_RECONCILE_WAITcan crash test collection on non-numeric input.A non-integer env value raises
ValueErrorat import time.Proposed fix
-RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12")) +try: + RECONCILE_WAIT = int(os.environ.get("E2E_RECONCILE_WAIT", "12")) +except ValueError: + RECONCILE_WAIT = 12🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` at line 36, The constant E2E_RECONCILE_WAIT is parsed with int(...) directly from os.environ which will raise ValueError during import for non-numeric env values; change the initialization of E2E_RECONCILE_WAIT to parse the environment value safely by reading os.environ.get("E2E_RECONCILE_WAIT") into a variable and converting inside a try/except (or using a safe numeric check), falling back to the default 12 when conversion fails or the value is missing so imports won't crash; update the usage site where E2E_RECONCILE_WAIT is referenced accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Around line 505-525: Replace the commented external-model block with a
feature-gated runtime check using an env var (e.g. RUN_EXTERNAL_MODEL_E2E); if
RUN_EXTERNAL_MODEL_E2E is true, export defaults for E2E_SIMULATOR_ENDPOINT and
E2E_SKIP_TLS_VERIFY (if not already set), echo a start message, run pytest on
test_external_models.py with the same flags, and exit non-zero on failure so CI
fails explicitly; keep the original pytest invocation and artifact variables
(ARTIFACTS_DIR, test_dir) and ensure the script prints a clear error when tests
fail.
In `@test/e2e/tests/test_external_models.py`:
- Around line 39-40: The test currently only uses a single provider (openai) via
EXTERNAL_MODEL_NAME; update the test fixtures and test functions in
test_external_models.py to be parameterized over providers and corresponding
model names (e.g., providers =
["openai","anthropic","azure_openai","vertex","bedrock"]) so each provider path
is exercised; modify or add a parametrized fixture (e.g.,
external_model_name/provider_name fixtures) and update the test(s) that call
discovery/auth/egress checks to run per-provider, implementing provider-specific
expectations inside the test (use conditional branches or a small mapping of
provider->expected discovery/auth/egress behavior) so discovery, authentication,
and egress assertions are executed for Anthropic, Azure OpenAI, Vertex, and
Bedrock as well as OpenAI. Ensure tests that previously referenced
EXTERNAL_MODEL_NAME now accept the provider param and use the mapped model
identifier for each provider.
- Around line 105-170: The fixture currently calls _apply_cr(...) repeatedly
without checking results, which lets CR creation failures go unnoticed; update
the fixture in test_external_models.py so each _apply_cr call (for the Secret,
ExternalModel, MaaSModelRef, MaaSAuthPolicy, and MaaSSubscription that reference
EXTERNAL_MODEL_NAME, EXTERNAL_AUTH_POLICY, EXTERNAL_SUBSCRIPTION,
MODEL_NAMESPACE, and SUBSCRIPTION_NAMESPACE) immediately detects failure and
aborts the setup by calling pytest.fail(...) (or raising) with an explanatory
message; you can do this by checking the return value or catching exceptions
from _apply_cr and invoking pytest.fail with the resource name and error
details.
- Line 173: Replace the fixed sleep call time.sleep(RECONCILE_WAIT * 2) with a
polling-based wait that uses the existing helper _wait_for_phase(...) and
explicit existence/non-existence assertions so tests wait for a concrete
reconcile condition with a timeout; locate the occurrences of
time.sleep(RECONCILE_WAIT * 2) (also the similar uses at the other two spots)
and change them to call _wait_for_phase(resource_identifier, expected_phase,
timeout=...) and/or loop-check the resource existence until the condition is
met, failing the test if the timeout elapses.
- Line 31: The code sets EXTERNAL_ENDPOINT from E2E_EXTERNAL_ENDPOINT but
ignores the simulator variable used by CI/docs; update EXTERNAL_ENDPOINT so it
first checks E2E_EXTERNAL_ENDPOINT and then falls back to E2E_SIMULATOR_ENDPOINT
before defaulting to "httpbin.org" (i.e., use nested/os.environ.get calls or
equivalent to read both env vars), referencing the EXTERNAL_ENDPOINT symbol and
the env names E2E_EXTERNAL_ENDPOINT and E2E_SIMULATOR_ENDPOINT so cluster runs
don’t silently fall back to the wrong endpoint.
- Around line 275-282: The current egress check only ensures r.status_code not
in (401, 403) which can be a false positive; change the assertion to require a
successful HTTP 200 (e.g., assert r.status_code == 200) and then parse/validate
the response body to confirm it matches the chat-completions format expected by
the test: verify Content-Type is JSON, that the parsed JSON contains top-level
keys like "model" and "choices", that "choices" is a non-empty list, and each
choice contains a "message" with "role" and "content" (or equivalent fields your
code expects). Locate the assertion around the variable r and the log.info call
in the test_external_models test and replace the weak status check with these
stronger status and response structure validations.
---
Duplicate comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 290-341: The test_delete_removes_httproute creates temporary
ExternalModel and MaaSModelRef resources but doesn't guarantee cleanup on
assertion failure; refactor test_delete_removes_httproute to wrap the creation,
verification, and deletion logic in a try/finally block so that _delete_cr is
always called for both the MaaSModelRef and the ExternalModel (use temp_name and
MODEL_NAMESPACE), keeping the existing _apply_cr, _get_cr, and asserts inside
try and moving the two _delete_cr calls into finally to ensure cleanup even on
failures.
- Around line 187-188: The test slices api_key without checking it exists
(api_key = r.json().get("key"); log.info(f"API key created:
{api_key[:15]}...")), which will raise TypeError if "key" is missing; update the
test to validate presence first by asserting or raising a clear error when
api_key is falsy (e.g., assert api_key is not None or raise RuntimeError with
response content), then perform the safe slice for logging (or log the full
value only after validation) so failures report the missing key and response
details clearly.
- Around line 72-80: The _get_cr function currently calls
json.loads(result.stdout) unguarded, which crashes the test suite if oc returns
partial or malformed JSON; update _get_cr to catch JSON decoding errors (catch
json.JSONDecodeError) and return None (or optionally log the stdout/stderr for
debugging) when decoding fails, and ensure you still check result.returncode !=
0 before parsing; reference the _get_cr function and the json.loads call when
making the change.
- Line 36: The constant E2E_RECONCILE_WAIT is parsed with int(...) directly from
os.environ which will raise ValueError during import for non-numeric env values;
change the initialization of E2E_RECONCILE_WAIT to parse the environment value
safely by reading os.environ.get("E2E_RECONCILE_WAIT") into a variable and
converting inside a try/except (or using a safe numeric check), falling back to
the default 12 when conversion fails or the value is missing so imports won't
crash; update the usage site where E2E_RECONCILE_WAIT is referenced accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 2f167364-764d-48a6-8668-107e63e31d9f
📒 Files selected for processing (2)
test/e2e/scripts/prow_run_smoke_test.shtest/e2e/tests/test_external_models.py
| # --- External Model E2E Tests --- | ||
| # Uncomment when BBR plugins are deployed as part of CI and the external | ||
| # simulator endpoint is accessible from the CI cluster. | ||
| # Prerequisites: | ||
| # 1. BBR image built and deployed (EnvoyFilter, NetworkPolicy, DestinationRule) | ||
| # 2. E2E_SIMULATOR_ENDPOINT set (default: 3.150.113.9) | ||
| # 3. ServiceEntry for the simulator registered in the mesh | ||
| # | ||
| # export E2E_SIMULATOR_ENDPOINT="${E2E_SIMULATOR_ENDPOINT:-3.150.113.9}" | ||
| # export E2E_SKIP_TLS_VERIFY="true" | ||
| # echo "Running external model e2e tests..." | ||
| # if ! PYTHONPATH="$test_dir:${PYTHONPATH:-}" pytest \ | ||
| # -v --maxfail=5 --disable-warnings \ | ||
| # --junitxml="${ARTIFACTS_DIR}/e2e-external-models.xml" \ | ||
| # --html="${ARTIFACTS_DIR}/e2e-external-models.html" --self-contained-html \ | ||
| # --capture=tee-sys --show-capture=all --log-level=INFO \ | ||
| # "$test_dir/tests/test_external_models.py" ; then | ||
| # echo "❌ ERROR: External model E2E tests failed" | ||
| # exit 1 | ||
| # fi | ||
|
|
There was a problem hiding this comment.
External-model smoke coverage is disabled by commented execution path.
This block never runs, so CI cannot catch regressions in test_external_models.py. Replace the commented section with a real feature gate (for example, RUN_EXTERNAL_MODEL_E2E=true) and execute/fail explicitly when enabled.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/scripts/prow_run_smoke_test.sh` around lines 505 - 525, Replace the
commented external-model block with a feature-gated runtime check using an env
var (e.g. RUN_EXTERNAL_MODEL_E2E); if RUN_EXTERNAL_MODEL_E2E is true, export
defaults for E2E_SIMULATOR_ENDPOINT and E2E_SKIP_TLS_VERIFY (if not already
set), echo a start message, run pytest on test_external_models.py with the same
flags, and exit non-zero on failure so CI fails explicitly; keep the original
pytest invocation and artifact variables (ARTIFACTS_DIR, test_dir) and ensure
the script prints a clear error when tests fail.
There was a problem hiding this comment.
I agree. may be just missed to uncomment :)
There was a problem hiding this comment.
@chaitanya1731, agreed — easy fix! Here's the replacement for the commented block, following the same feature-gate pattern already used by EXTERNAL_OIDC, SKIP_DEPLOYMENT, etc. in the script:
# --- External Model E2E Tests ---
# Set RUN_EXTERNAL_MODEL_E2E=true when BBR plugins are deployed as part of CI
# and the external simulator endpoint is accessible from the CI cluster.
# Prerequisites:
# 1. BBR image built and deployed (EnvoyFilter, NetworkPolicy, DestinationRule)
# 2. E2E_SIMULATOR_ENDPOINT set (default: 3.150.113.9)
# 3. ServiceEntry for the simulator registered in the mesh
if [[ "${RUN_EXTERNAL_MODEL_E2E:-false}" == "true" ]]; then
export E2E_SIMULATOR_ENDPOINT="${E2E_SIMULATOR_ENDPOINT:-3.150.113.9}"
export E2E_SKIP_TLS_VERIFY="true"
echo "Running external model e2e tests (E2E_SIMULATOR_ENDPOINT=${E2E_SIMULATOR_ENDPOINT})..."
if ! PYTHONPATH="$test_dir:${PYTHONPATH:-}" pytest \
-v --maxfail=5 --disable-warnings \
--junitxml="${ARTIFACTS_DIR}/e2e-external-models.xml" \
--html="${ARTIFACTS_DIR}/e2e-external-models.html" --self-contained-html \
--capture=tee-sys --show-capture=all --log-level=INFO \
"$test_dir/tests/test_external_models.py" ; then
echo "❌ ERROR: External model E2E tests failed"
exit 1
fi
echo "✅ External model E2E tests completed"
else
echo "⏭️ Skipping external model E2E tests (RUN_EXTERNAL_MODEL_E2E not set to true)"
fiAlso add RUN_EXTERNAL_MODEL_E2E to the ENVIRONMENT VARIABLES header block at the top of the script:
# RUN_EXTERNAL_MODEL_E2E - Run external model E2E tests (default: false)
# Requires BBR deployed and E2E_SIMULATOR_ENDPOINT accessible
# E2E_SIMULATOR_ENDPOINT - External model simulator endpoint (default: 3.150.113.9)
# Used only when RUN_EXTERNAL_MODEL_E2E=trueThis way the gate is explicit, CI fails loudly when enabled, and the default keeps current CI behaviour unchanged.
There was a problem hiding this comment.
Good catch — the tests no longer depend on BBR or the simulator (they use httpbin.org now). Will uncomment so they run in CI.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Already fixed — the external model tests are uncommented in the latest revision.
| EXTERNAL_MODEL_NAME = "e2e-external-model" | ||
|
|
There was a problem hiding this comment.
Suite is single-provider only (openai) and does not exercise provider-specific paths.
Parameterize provider/model fixtures and run discovery/auth/egress assertions per provider to cover Anthropic, Azure OpenAI, Vertex, and Bedrock flows.
Also applies to: 122-123, 208-282
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 39 - 40, The test
currently only uses a single provider (openai) via EXTERNAL_MODEL_NAME; update
the test fixtures and test functions in test_external_models.py to be
parameterized over providers and corresponding model names (e.g., providers =
["openai","anthropic","azure_openai","vertex","bedrock"]) so each provider path
is exercised; modify or add a parametrized fixture (e.g.,
external_model_name/provider_name fixtures) and update the test(s) that call
discovery/auth/egress checks to run per-provider, implementing provider-specific
expectations inside the test (use conditional branches or a small mapping of
provider->expected discovery/auth/egress behavior) so discovery, authentication,
and egress assertions are executed for Anthropic, Azure OpenAI, Vertex, and
Bedrock as well as OpenAI. Ensure tests that previously referenced
EXTERNAL_MODEL_NAME now accept the provider param and use the mapped model
identifier for each provider.
There was a problem hiding this comment.
By design — these tests validate MaaS egress capabilities (discovery, auth, routing), not provider-specific plugin behavior. Provider-specific testing is in the ai-gateway-payload-processing repo.
There was a problem hiding this comment.
@noyitz, understood — the architectural boundary makes sense. MaaS egress routing, auth enforcement, and discovery are provider-agnostic concerns, and provider-specific payload/plugin coverage belongs in ai-gateway-payload-processing. I'll drop this suggestion.
(¯·.¸¸.·´¯·.¸¸.·´¯·.¸¸.·´¯·.¸¯·.¸¸.·´¯·.¸¸.·´¯`·.¸¸.·´¯)
(\ /)
( . .)
c(")(") hops along
📝 Learnings were identified and not saved (knowledge base disabled). Enable
| _apply_cr({ | ||
| "apiVersion": "v1", | ||
| "kind": "Secret", | ||
| "metadata": { | ||
| "name": f"{EXTERNAL_MODEL_NAME}-api-key", | ||
| "namespace": MODEL_NAMESPACE, | ||
| }, | ||
| "type": "Opaque", | ||
| "stringData": {"api-key": "e2e-test-key"}, | ||
| }) | ||
|
|
||
| # Create ExternalModel CR | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "ExternalModel", | ||
| "metadata": {"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}, | ||
| "spec": { | ||
| "provider": "openai", | ||
| "endpoint": EXTERNAL_ENDPOINT, | ||
| "credentialRef": { | ||
| "name": f"{EXTERNAL_MODEL_NAME}-api-key", | ||
| "namespace": MODEL_NAMESPACE, | ||
| }, | ||
| }, | ||
| }) | ||
|
|
||
| # Create MaaSModelRef | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "MaaSModelRef", | ||
| "metadata": { | ||
| "name": EXTERNAL_MODEL_NAME, | ||
| "namespace": MODEL_NAMESPACE, | ||
| "annotations": { | ||
| "maas.opendatahub.io/endpoint": EXTERNAL_ENDPOINT, | ||
| "maas.opendatahub.io/provider": "openai", | ||
| }, | ||
| }, | ||
| "spec": { | ||
| "modelRef": {"kind": "ExternalModel", "name": EXTERNAL_MODEL_NAME}, | ||
| }, | ||
| }) | ||
|
|
||
| # Create MaaSAuthPolicy | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "MaaSAuthPolicy", | ||
| "metadata": {"name": EXTERNAL_AUTH_POLICY, "namespace": SUBSCRIPTION_NAMESPACE}, | ||
| "spec": { | ||
| "modelRefs": [{"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}], | ||
| "subjects": {"groups": [{"name": "system:authenticated"}]}, | ||
| }, | ||
| }) | ||
|
|
||
| # Create MaaSSubscription | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "MaaSSubscription", | ||
| "metadata": {"name": EXTERNAL_SUBSCRIPTION, "namespace": SUBSCRIPTION_NAMESPACE}, | ||
| "spec": { | ||
| "owner": {"groups": [{"name": "system:authenticated"}]}, | ||
| "modelRefs": [ | ||
| {"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}, | ||
| ], | ||
| }, | ||
| }) |
There was a problem hiding this comment.
Fixture ignores CR creation failures and proceeds with invalid setup state.
Every _apply_cr(...) here should be checked, and fixture setup should pytest.fail(...) immediately on failure to avoid cascading false failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 105 - 170, The fixture
currently calls _apply_cr(...) repeatedly without checking results, which lets
CR creation failures go unnoticed; update the fixture in test_external_models.py
so each _apply_cr call (for the Secret, ExternalModel, MaaSModelRef,
MaaSAuthPolicy, and MaaSSubscription that reference EXTERNAL_MODEL_NAME,
EXTERNAL_AUTH_POLICY, EXTERNAL_SUBSCRIPTION, MODEL_NAMESPACE, and
SUBSCRIPTION_NAMESPACE) immediately detects failure and aborts the setup by
calling pytest.fail(...) (or raising) with an explanatory message; you can do
this by checking the return value or catching exceptions from _apply_cr and
invoking pytest.fail with the resource name and error details.
There was a problem hiding this comment.
The _apply_cr helper logs warnings on failure. If a CR fails to create, the dependent tests will fail with clear assertions. Adding pytest.fail on every apply would make the fixture brittle on retries.
There was a problem hiding this comment.
@noyitz, fair point on idempotency — oc apply re-runs are common and pytest.fail on each call would break that pattern.
One lightweight middle ground worth considering: collect failures across all _apply_cr calls and emit a single pytest.fail (or pytest.skip) at the end of setup if any failed. This preserves retry-friendliness while surfacing the root cause before tests run with a half-provisioned state:
failures = []
if not _apply_cr({...secret...}):
failures.append(f"Secret/{EXTERNAL_MODEL_NAME}-api-key")
if not _apply_cr({...ExternalModel...}):
failures.append(f"ExternalModel/{EXTERNAL_MODEL_NAME}")
# ... etc.
if failures:
pytest.fail(f"CR creation failed for: {', '.join(failures)} — check oc apply output above")This way a retry on a pre-existing cluster still works (all _apply_cr calls return True), but a genuine provisioning failure in CI gives a single, unambiguous root-cause message instead of 7 cascading assertion errors pointing at missing resources. Up to you whether that tradeoff is worth the extra lines.
( ˙▿˙)
| }) | ||
|
|
||
| # Wait for reconciler + auth propagation | ||
| time.sleep(RECONCILE_WAIT * 2) |
There was a problem hiding this comment.
Fixed sleeps make reconciliation checks flaky.
Use polling with explicit conditions/timeouts (for example _wait_for_phase(...) plus existence/non-existence checks) instead of static sleeps.
Also applies to: 326-326, 334-334
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` at line 173, Replace the fixed sleep
call time.sleep(RECONCILE_WAIT * 2) with a polling-based wait that uses the
existing helper _wait_for_phase(...) and explicit existence/non-existence
assertions so tests wait for a concrete reconcile condition with a timeout;
locate the occurrences of time.sleep(RECONCILE_WAIT * 2) (also the similar uses
at the other two spots) and change them to call
_wait_for_phase(resource_identifier, expected_phase, timeout=...) and/or
loop-check the resource existence until the condition is met, failing the test
if the timeout elapses.
There was a problem hiding this comment.
The sleep is waiting for the reconciler to create Istio resources (HTTPRoute, Service, etc.) from the ExternalModel CR. Polling for these specific resources would add complexity without much benefit — the reconcile time is predictable in CI.
There was a problem hiding this comment.
@noyitz, understood — if reconcile latency is stable in your CI environment, the fixed sleep is reasonable. The residual risk is that under cluster load spikes or API server slowdowns the hardcoded window could still be exceeded, causing intermittent failures unrelated to the feature under test. If you ever see flakiness here, the quickest mitigation would be polling for the HTTPRoute existence (which _get_cr already supports) rather than extending the sleep value. No change needed from my side.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
| assert r.status_code not in (401, 403), ( | ||
| f"Request was blocked by auth (HTTP {r.status_code}). " | ||
| f"Expected the request to reach the external endpoint." | ||
| ) | ||
| # Any non-auth response confirms egress connectivity. | ||
| # httpbin.org may return 404 for unknown paths — that's fine, | ||
| # it means the request left the cluster and reached the endpoint. | ||
| log.info(f"Egress test: HTTP {r.status_code} from external endpoint") |
There was a problem hiding this comment.
Egress assertion is too weak and can pass on backend failures.
status_code not in (401, 403) allows false positives (for example 500/404). Assert expected success (HTTP 200) and validate required response fields for chat-completions format.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 275 - 282, The current
egress check only ensures r.status_code not in (401, 403) which can be a false
positive; change the assertion to require a successful HTTP 200 (e.g., assert
r.status_code == 200) and then parse/validate the response body to confirm it
matches the chat-completions format expected by the test: verify Content-Type is
JSON, that the parsed JSON contains top-level keys like "model" and "choices",
that "choices" is a non-empty list, and each choice contains a "message" with
"role" and "content" (or equivalent fields your code expects). Locate the
assertion around the variable r and the log.info call in the
test_external_models test and replace the weak status check with these stronger
status and response structure validations.
There was a problem hiding this comment.
The test confirms the request passed auth and reached the external endpoint. httpbin.org may return various codes depending on the path. The key assertion is that auth didn't block it (not 401/403). A 404 from httpbin still proves egress works.
There was a problem hiding this comment.
@noyitz, that reasoning is sound for the httpbin case — a 404 from httpbin definitively proves the request left the cluster. The one residual ambiguity is gateway-originated 5xx errors (502/503/504): those can satisfy not in (401, 403) even when the request never reached the external endpoint (e.g., the egress rule is misconfigured and the gateway itself is timing out). A simple guard like assert r.status_code != 502, "Gateway bad-gateway — request may not have reached external endpoint" could differentiate those. That said, if the intent is strictly "auth didn't block it" and you're comfortable treating 5xx as an acceptable egress signal for now, I understand the trade-off.
(╯°□°)╯
| # --- External Model E2E Tests --- | ||
| # Uncomment when BBR plugins are deployed as part of CI and the external | ||
| # simulator endpoint is accessible from the CI cluster. | ||
| # Prerequisites: | ||
| # 1. BBR image built and deployed (EnvoyFilter, NetworkPolicy, DestinationRule) | ||
| # 2. E2E_SIMULATOR_ENDPOINT set (default: 3.150.113.9) | ||
| # 3. ServiceEntry for the simulator registered in the mesh | ||
| # | ||
| # export E2E_SIMULATOR_ENDPOINT="${E2E_SIMULATOR_ENDPOINT:-3.150.113.9}" | ||
| # export E2E_SKIP_TLS_VERIFY="true" | ||
| # echo "Running external model e2e tests..." | ||
| # if ! PYTHONPATH="$test_dir:${PYTHONPATH:-}" pytest \ | ||
| # -v --maxfail=5 --disable-warnings \ | ||
| # --junitxml="${ARTIFACTS_DIR}/e2e-external-models.xml" \ | ||
| # --html="${ARTIFACTS_DIR}/e2e-external-models.html" --self-contained-html \ | ||
| # --capture=tee-sys --show-capture=all --log-level=INFO \ | ||
| # "$test_dir/tests/test_external_models.py" ; then | ||
| # echo "❌ ERROR: External model E2E tests failed" | ||
| # exit 1 | ||
| # fi | ||
|
|
There was a problem hiding this comment.
I agree. may be just missed to uncomment :)
| "$test_dir/tests/test_namespace_scoping.py" \ | ||
| "$test_dir/tests/test_subscription.py" \ | ||
| "$test_dir/tests/test_models_endpoint.py" \ | ||
| "$test_dir/tests/test_external_oidc.py" ; then |
There was a problem hiding this comment.
I think we still need these two tests..
There was a problem hiding this comment.
The test_external_oidc.py removal came from upstream main during rebase (PR #598 restructured OIDC handling). It's not a change from this PR — our diff only adds the external model tests.
faba622 to
0027a79
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
test/e2e/tests/test_external_models.py (2)
261-282:⚠️ Potential issue | 🟠 MajorThis does not prove the request reached the external service.
The test name/docstring say “returns 200”, but the assertion accepts any non-401/403 response.
404,500, or503from the gateway/backend all pass here, so forwarding can be broken while the test stays green.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 261 - 282, The test test_request_forwarded_returns_200 currently only checks that the response is not 401/403 which doesn't prove the request reached the external service; change the assertion to require an actual successful egress response (e.g. assert r.status_code == 200) or explicitly allow only known valid upstream statuses (e.g. 200 or an expected 404 from the external host) and update the test docstring/name if you broaden the accepted statuses; modify the assertion around the requests.post call that uses external_models_setup and EXTERNAL_MODEL_NAME to enforce the specific expected status codes.
172-173:⚠️ Potential issue | 🟠 MajorReplace these sleeps with condition-based waits.
Line 173 and Lines 327-335 are still guessing reconciliation timing. On a slow cluster the assertions fire too early; on a fast cluster the suite just burns time. Poll
MaaSModelRef.status.phaseplusHTTPRouteexistence/non-existence instead of sleeping on a stopwatch.Also applies to: 325-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 172 - 173, Replace blind time.sleep calls (e.g., the RECONCILE_WAIT * 2 at the top and the block around lines 327-335) with condition-based polling: implement a short-poll loop with a global timeout that repeatedly fetches the MaaSModelRef resource and asserts on MaaSModelRef.status.phase reaching the expected value and simultaneously checks Kubernetes for the presence or absence of the corresponding HTTPRoute resource; fail if the timeout elapses. Use or add a reusable helper (e.g., wait_for_condition or poll_until) called from the test to encapsulate retries, sleeps, and timeout, and reference the resource by the same names used in the test (MaaSModelRef and HTTPRoute) so the assertions fire only when the reconciler/auth propagation actually completed.test/e2e/scripts/prow_run_smoke_test.sh (1)
492-500:⚠️ Potential issue | 🟠 Major
test_external_oidc.pyis no longer executed.This script still wires
EXTERNAL_OIDC/OIDC_*, andtest_external_oidc.pyalready self-skips unlessEXTERNAL_OIDC=true. Dropping it from the pytest list silently removes that coverage instead of gating it.Suggested fix
if ! PYTHONPATH="$test_dir:${PYTHONPATH:-}" pytest \ -v --maxfail=5 --disable-warnings \ --junitxml="$xml" \ --html="$html" --self-contained-html \ --capture=tee-sys --show-capture=all --log-level=INFO \ "$test_dir/tests/test_api_keys.py" \ "$test_dir/tests/test_namespace_scoping.py" \ "$test_dir/tests/test_subscription.py" \ - "$test_dir/tests/test_models_endpoint.py" ; then + "$test_dir/tests/test_models_endpoint.py" \ + "$test_dir/tests/test_external_oidc.py" ; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/scripts/prow_run_smoke_test.sh` around lines 492 - 500, The pytest invocation in prow_run_smoke_test.sh was changed to remove test_external_oidc.py, which means EXTERNAL_OIDC/OIDC_*-gated coverage is no longer run; restore "$test_dir/tests/test_external_oidc.py" to the list of tests passed to pytest (the same block that currently lists test_api_keys.py, test_namespace_scoping.py, test_subscription.py, test_models_endpoint.py) so that test_external_oidc.py is executed (it will self-skip when EXTERNAL_OIDC is not true); update the pytest command in the shell script accordingly to include that filename.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/tests/test_external_models.py`:
- Line 33: The test is reading the wrong env var: update SUBSCRIPTION_NAMESPACE
in test/e2e/tests/test_external_models.py to use
os.environ.get("MAAS_SUBSCRIPTION_NAMESPACE", "models-as-a-service") (instead of
"E2E_SUBSCRIPTION_NAMESPACE") so MaaSAuthPolicy and MaaSSubscription are created
in the same namespace the smoke script exports; keep the same default fallback
value and adjust any references to SUBSCRIPTION_NAMESPACE accordingly.
- Around line 97-203: The external_models_setup fixture can leak resources if an
exception occurs before yield (e.g., when pytest.fail() is called after API key
creation fails); wrap the setup sequence in a try/finally (or register a
finalizer with pytest.fixture.addfinalizer) so that cleanup calls (_delete_cr,
_patch_cr for "maasmodelref", and deletion of the Secret, ExternalModel,
MaaSAuthPolicy, MaaSSubscription) always run even if external_models_setup
raises before yield; locate and protect the block that calls _apply_cr, the
requests.post that may call pytest.fail, and the yield within the
external_models_setup function to ensure the finalizer/finally executes
unconditionally.
---
Duplicate comments:
In `@test/e2e/scripts/prow_run_smoke_test.sh`:
- Around line 492-500: The pytest invocation in prow_run_smoke_test.sh was
changed to remove test_external_oidc.py, which means EXTERNAL_OIDC/OIDC_*-gated
coverage is no longer run; restore "$test_dir/tests/test_external_oidc.py" to
the list of tests passed to pytest (the same block that currently lists
test_api_keys.py, test_namespace_scoping.py, test_subscription.py,
test_models_endpoint.py) so that test_external_oidc.py is executed (it will
self-skip when EXTERNAL_OIDC is not true); update the pytest command in the
shell script accordingly to include that filename.
In `@test/e2e/tests/test_external_models.py`:
- Around line 261-282: The test test_request_forwarded_returns_200 currently
only checks that the response is not 401/403 which doesn't prove the request
reached the external service; change the assertion to require an actual
successful egress response (e.g. assert r.status_code == 200) or explicitly
allow only known valid upstream statuses (e.g. 200 or an expected 404 from the
external host) and update the test docstring/name if you broaden the accepted
statuses; modify the assertion around the requests.post call that uses
external_models_setup and EXTERNAL_MODEL_NAME to enforce the specific expected
status codes.
- Around line 172-173: Replace blind time.sleep calls (e.g., the RECONCILE_WAIT
* 2 at the top and the block around lines 327-335) with condition-based polling:
implement a short-poll loop with a global timeout that repeatedly fetches the
MaaSModelRef resource and asserts on MaaSModelRef.status.phase reaching the
expected value and simultaneously checks Kubernetes for the presence or absence
of the corresponding HTTPRoute resource; fail if the timeout elapses. Use or add
a reusable helper (e.g., wait_for_condition or poll_until) called from the test
to encapsulate retries, sleeps, and timeout, and reference the resource by the
same names used in the test (MaaSModelRef and HTTPRoute) so the assertions fire
only when the reconciler/auth propagation actually completed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 65a7046b-4beb-47f9-bc8e-571e696e4d2a
📒 Files selected for processing (2)
test/e2e/scripts/prow_run_smoke_test.shtest/e2e/tests/test_external_models.py
| def external_models_setup(gateway_url, headers, api_keys_base_url): | ||
| """ | ||
| Create a single ExternalModel CR, MaaSModelRef, AuthPolicy, and | ||
| Subscription pointing to an external endpoint. Cleanup after tests. | ||
| """ | ||
| log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...") | ||
|
|
||
| # Create a dummy secret (ExternalModel requires credentialRef) | ||
| _apply_cr({ | ||
| "apiVersion": "v1", | ||
| "kind": "Secret", | ||
| "metadata": { | ||
| "name": f"{EXTERNAL_MODEL_NAME}-api-key", | ||
| "namespace": MODEL_NAMESPACE, | ||
| }, | ||
| "type": "Opaque", | ||
| "stringData": {"api-key": "e2e-test-key"}, | ||
| }) | ||
|
|
||
| # Create ExternalModel CR | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "ExternalModel", | ||
| "metadata": {"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}, | ||
| "spec": { | ||
| "provider": "openai", | ||
| "endpoint": EXTERNAL_ENDPOINT, | ||
| "credentialRef": { | ||
| "name": f"{EXTERNAL_MODEL_NAME}-api-key", | ||
| "namespace": MODEL_NAMESPACE, | ||
| }, | ||
| }, | ||
| }) | ||
|
|
||
| # Create MaaSModelRef | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "MaaSModelRef", | ||
| "metadata": { | ||
| "name": EXTERNAL_MODEL_NAME, | ||
| "namespace": MODEL_NAMESPACE, | ||
| "annotations": { | ||
| "maas.opendatahub.io/endpoint": EXTERNAL_ENDPOINT, | ||
| "maas.opendatahub.io/provider": "openai", | ||
| }, | ||
| }, | ||
| "spec": { | ||
| "modelRef": {"kind": "ExternalModel", "name": EXTERNAL_MODEL_NAME}, | ||
| }, | ||
| }) | ||
|
|
||
| # Create MaaSAuthPolicy | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "MaaSAuthPolicy", | ||
| "metadata": {"name": EXTERNAL_AUTH_POLICY, "namespace": SUBSCRIPTION_NAMESPACE}, | ||
| "spec": { | ||
| "modelRefs": [{"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}], | ||
| "subjects": {"groups": [{"name": "system:authenticated"}]}, | ||
| }, | ||
| }) | ||
|
|
||
| # Create MaaSSubscription | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "MaaSSubscription", | ||
| "metadata": {"name": EXTERNAL_SUBSCRIPTION, "namespace": SUBSCRIPTION_NAMESPACE}, | ||
| "spec": { | ||
| "owner": {"groups": [{"name": "system:authenticated"}]}, | ||
| "modelRefs": [ | ||
| {"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}, | ||
| ], | ||
| }, | ||
| }) | ||
|
|
||
| # Wait for reconciler + auth propagation | ||
| time.sleep(RECONCILE_WAIT * 2) | ||
|
|
||
| # Create API key for tests | ||
| log.info("Creating API key for external model tests...") | ||
| r = requests.post( | ||
| api_keys_base_url, | ||
| headers=headers, | ||
| json={"name": "e2e-external-model-key", "subscription": EXTERNAL_SUBSCRIPTION}, | ||
| timeout=30, | ||
| verify=TLS_VERIFY, | ||
| ) | ||
| if r.status_code not in (200, 201): | ||
| pytest.fail(f"Failed to create API key: {r.status_code} {r.text}") | ||
|
|
||
| api_key = r.json().get("key") | ||
| log.info(f"API key created: {api_key[:15]}...") | ||
|
|
||
| yield { | ||
| "api_key": api_key, | ||
| "gateway_url": gateway_url, | ||
| } | ||
|
|
||
| # ── Cleanup ── | ||
| log.info("Cleaning up external model test fixtures...") | ||
| _delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE) | ||
| _delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE) | ||
| _patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE, | ||
| {"metadata": {"finalizers": []}}) | ||
| _delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE) | ||
| _delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE) | ||
| _delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For pytest 8.0.0, if a fixture implemented with yieldraises before reaching theyield, does pytest execute the teardown code after yield?
💡 Result:
No, if a fixture implemented with yield raises an exception before reaching the yield statement, pytest does not execute the teardown code after the yield.
Citations:
- 1: https://stackoverflow.com/questions/78721505/choosing-between-yield-and-addfinalizer-in-pytest-fixtures-for-teardown
- 2: https://docs.pytest.org/en/6.2.x/fixture.html?highlight=yield
- 3: https://pytest.org/latest/how-to/fixtures.html
- 4: https://docs.pytest.org/en/6.2.x/fixture.html?highlight=teardown
Fixture cleanup is skipped when setup fails before yield—resources leak into cluster.
At line 186, pytest.fail() raises before yield is reached. Since pytest does not execute teardown code after yield if an exception occurs before reaching it, the Secret, ExternalModel, MaaSModelRef, auth policy, and subscription are never cleaned up. This poisons retries and pollutes the cluster state.
Use pytest.fixture with addfinalizer() or wrap setup in a try/finally to ensure cleanup runs regardless of early failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 97 - 203, The
external_models_setup fixture can leak resources if an exception occurs before
yield (e.g., when pytest.fail() is called after API key creation fails); wrap
the setup sequence in a try/finally (or register a finalizer with
pytest.fixture.addfinalizer) so that cleanup calls (_delete_cr, _patch_cr for
"maasmodelref", and deletion of the Secret, ExternalModel, MaaSAuthPolicy,
MaaSSubscription) always run even if external_models_setup raises before yield;
locate and protect the block that calls _apply_cr, the requests.post that may
call pytest.fail, and the yield within the external_models_setup function to
ensure the finalizer/finally executes unconditionally.
There was a problem hiding this comment.
The cleanup test already uses try/finally to prevent resource leaks. The main fixture's cleanup section runs via yield teardown, which pytest guarantees even on test failures.
There was a problem hiding this comment.
@noyitz, there's an important distinction here that the response conflates:
- Test failures (after
yield) — pytest does guarantee the teardown block runs. ✅ - Fixture setup exceptions (before
yield) — pytest does not execute the teardown block afteryield. ✗
The concern is specifically about scenario 2. In this fixture, pytest.fail() is raised at Line 186, which is before the yield. If this path is hit, the Secret, ExternalModel, MaaSModelRef, MaaSAuthPolicy, and MaaSSubscription already applied by _apply_cr are never cleaned up.
This is confirmed by pytest's own documentation and the earlier web search result. The cleanup section after yield only executes if yield is actually reached.
A minimal fix wraps the setup in try/finally:
`@pytest.fixture`(scope="module")
def external_models_setup(gateway_url, headers, api_keys_base_url):
api_key = None
try:
# ... all _apply_cr calls ...
time.sleep(RECONCILE_WAIT * 2)
r = requests.post(...)
if r.status_code not in (200, 201):
pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")
api_key = r.json().get("key")
yield {"api_key": api_key, "gateway_url": gateway_url}
finally:
log.info("Cleaning up external model test fixtures...")
_delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE)
_delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE)
_patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE,
{"metadata": {"finalizers": []}})
_delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
_delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
_delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE)The try/finally block ensures cleanup runs unconditionally — whether the pytest.fail() path is hit during setup, a test fails after yield, or everything succeeds normally.
0027a79 to
e3ae602
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
test/e2e/tests/test_external_models.py (4)
104-203:⚠️ Potential issue | 🟠 MajorMake the fixture cleanup unconditional.
If any setup step fails before
yield—including an_apply_cr(...)error, arequests.post(...)exception, orpytest.fail()on Lines 184-185—the teardown below is skipped and the Secret, ExternalModel, MaaSModelRef, AuthPolicy, and Subscription are left behind. Wrap the whole setup/yield intry/finallyor register a finalizer before creating the first resource.Suggested fix
`@pytest.fixture`(scope="module") def external_models_setup(gateway_url, headers, api_keys_base_url): """ Create a single ExternalModel CR, MaaSModelRef, AuthPolicy, and Subscription pointing to an external endpoint. Cleanup after tests. """ - log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...") - - # Create a dummy secret (ExternalModel requires credentialRef) - _apply_cr({ - ... - }) - - # Create ExternalModel CR - _apply_cr({ - ... - }) - - # Create MaaSModelRef - _apply_cr({ - ... - }) - - # Create MaaSAuthPolicy - _apply_cr({ - ... - }) - - # Create MaaSSubscription - _apply_cr({ - ... - }) - - # Wait for reconciler + auth propagation - time.sleep(RECONCILE_WAIT * 2) - - # Create API key for tests - log.info("Creating API key for external model tests...") - r = requests.post( - api_keys_base_url, - headers=headers, - json={"name": "e2e-external-model-key", "subscription": EXTERNAL_SUBSCRIPTION}, - timeout=30, - verify=TLS_VERIFY, - ) - if r.status_code not in (200, 201): - pytest.fail(f"Failed to create API key: {r.status_code} {r.text}") - - api_key = r.json().get("key") - log.info(f"API key created: {api_key[:15]}...") - - yield { - "api_key": api_key, - "gateway_url": gateway_url, - } - - # ── Cleanup ── - log.info("Cleaning up external model test fixtures...") - _delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE) - _delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE) - _patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE, - {"metadata": {"finalizers": []}}) - _delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE) - _delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE) - _delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE) + try: + log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...") + _apply_cr({...}) + _apply_cr({...}) + _apply_cr({...}) + _apply_cr({...}) + _apply_cr({...}) + time.sleep(RECONCILE_WAIT * 2) + + log.info("Creating API key for external model tests...") + r = requests.post( + api_keys_base_url, + headers=headers, + json={"name": "e2e-external-model-key", "subscription": EXTERNAL_SUBSCRIPTION}, + timeout=30, + verify=TLS_VERIFY, + ) + if r.status_code not in (200, 201): + pytest.fail(f"Failed to create API key: {r.status_code} {r.text}") + + api_key = r.json().get("key") + yield {"api_key": api_key, "gateway_url": gateway_url} + finally: + log.info("Cleaning up external model test fixtures...") + _delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE) + _delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE) + _patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE, + {"metadata": {"finalizers": []}}) + _delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE) + _delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE) + _delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE)As per coding guidelines,
**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 104 - 203, The fixture currently performs multiple _apply_cr calls, requests.post and a yield without guaranteeing cleanup if setup fails; wrap the entire setup/yield sequence in a try/finally (or register a pytest finalizer) so the cleanup calls (_delete_cr, _patch_cr) always run even on exceptions or pytest.fail(), and ensure api_key creation (requests.post) errors are allowed to bubble to the finally so resources are removed; locate the setup block with _apply_cr, the API key creation using requests.post, the yield returning {"api_key": api_key, "gateway_url": gateway_url}, and the cleanup calls (_delete_cr/_patch_cr) and move them into the finally (or into the registered finalizer) to guarantee unconditional teardown.
274-282:⚠️ Potential issue | 🟠 MajorA non-401/403 response does not prove egress happened.
404,500, or503can still come from the gateway or an intermediate hop, so this test can pass without ever validating external routing. Assert on a response characteristic unique to the configured endpoint, or switch the test endpoint to one with a stable success contract and validate that instead.As per coding guidelines,
**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 274 - 282, The current check that r.status_code not in (401, 403) can be fooled by gateway-generated 404/500/503; update the requests.post block to target a stable httpbin endpoint (e.g. use the /get path) and assert a response characteristic unique to that endpoint: require r.status_code == 200 and validate the returned payload (e.g. call r.json() and assert a field like "url" or "origin" matches expectations) so the test proves the request reached the external service; locate the requests.post call and variables (r, url, headers, body, TLS_VERIFY, log) and replace the loose auth-only assertion with the precise status and payload checks.
44-53:⚠️ Potential issue | 🟠 MajorStop returning a soft failure from
_apply_cr.Every caller assumes the resource exists after this helper returns. Logging and returning
Falselets setup continue in a partial cluster state and turns the first real failure into a cascade of unrelated assertions.Suggested fix
def _apply_cr(cr_dict: dict): """Apply a Kubernetes CR from a dict.""" result = subprocess.run( ["oc", "apply", "-f", "-"], input=json.dumps(cr_dict), capture_output=True, text=True, ) if result.returncode != 0: - log.warning(f"oc apply failed: {result.stderr}") - return result.returncode == 0 + meta = cr_dict.get("metadata", {}) + ident = f"{cr_dict.get('kind')}/{meta.get('namespace', '<unknown>')}/{meta.get('name', '<unknown>')}" + raise RuntimeError(f"oc apply failed for {ident}: {result.stderr.strip()}") + return TrueAs per coding guidelines,
**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 44 - 53, The helper _apply_cr currently logs failures and returns False which lets callers assume the resource exists; change it to fail hard by raising an exception when oc apply returns non-zero: after running subprocess.run in _apply_cr, if result.returncode != 0 raise a RuntimeError or subprocess.CalledProcessError that includes the failing command, result.returncode and result.stderr (and/or stdout) so callers cannot proceed on partial state; keep the function name _apply_cr and preserve the subprocess.run invocation but replace the soft-logging path with a thrown error that surfaces the oc output.
173-173:⚠️ Potential issue | 🟠 MajorReplace guessed sleeps with condition-based waits.
These sleeps make the suite timing-dependent: slow reconciles fail spuriously, fast reconciles just waste CI time. Poll for the actual conditions you assert later (route/service created, route removed) instead of sleeping
RECONCILE_WAIT * N.As per coding guidelines,
**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.Also applies to: 327-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` at line 173, Replace the fixed sleep calls (time.sleep(RECONCILE_WAIT * 2)) with condition-based polling using the existing reconcile timeout constant (RECONCILE_WAIT) and the actual checks used later in the test (e.g., existence/absence of the route or service); implement a small wait loop that polls the cluster/client for the expected condition (route/service created or removed) with a reasonable poll interval and a total timeout (like a multiple of RECONCILE_WAIT), and fail the test if the timeout elapses; apply the same replacement for the other occurrences referenced (around lines 327-335) so tests wait for the real condition instead of sleeping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 187-188: The test currently logs secret material via the api_key
variable (api_key = r.json().get("key") and log.info(f"API key created:
{api_key[:15]}...")); stop emitting any part of the secret to logs — replace
that log call with a non-secret placeholder or a safe identifier from the
response (e.g., use r.json().get("id") or log "API key created: <REDACTED>") so
no credential bytes are written to CI artifacts, and ensure only non-secret
fields are logged.
- Around line 72-80: The helper _get_cr currently treats any non-zero oc exit as
“not found”; change it so that when subprocess.run returns non-zero you inspect
result.stderr (and/or result.stdout) for a Kubernetes NotFound indicator (e.g.,
contains "NotFound" or "not found") and only return None in that case; for any
other error (transport/auth/CLI failure) raise an exception (RuntimeError or
subprocess.CalledProcessError) including the command, returncode and stderr so
callers can distinguish API/CLI failures from true NotFound. Ensure the change
is made inside the _get_cr function and preserves the existing successful
json.loads(result.stdout) path.
- Around line 46-50: The subprocess.run calls invoking "oc" in
test/e2e/tests/test_external_models.py currently lack a subprocess-level timeout
causing potential hangs; update each oc helper subprocess.run invocation (the
calls shown around the variables result = subprocess.run(...), the ones used to
apply CRs and any subsequent oc commands in the same test such as the calls at
lines roughly 46–50, 58–61, 66–69, 74–77) to include a sensible timeout=
parameter (e.g., timeout=30) so the Python process will raise a TimeoutExpired
instead of hanging; ensure you add the timeout argument to each subprocess.run
call that runs "oc" and handle or let the exception propagate consistently with
existing test/fixture cleanup behavior.
---
Duplicate comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 104-203: The fixture currently performs multiple _apply_cr calls,
requests.post and a yield without guaranteeing cleanup if setup fails; wrap the
entire setup/yield sequence in a try/finally (or register a pytest finalizer) so
the cleanup calls (_delete_cr, _patch_cr) always run even on exceptions or
pytest.fail(), and ensure api_key creation (requests.post) errors are allowed to
bubble to the finally so resources are removed; locate the setup block with
_apply_cr, the API key creation using requests.post, the yield returning
{"api_key": api_key, "gateway_url": gateway_url}, and the cleanup calls
(_delete_cr/_patch_cr) and move them into the finally (or into the registered
finalizer) to guarantee unconditional teardown.
- Around line 274-282: The current check that r.status_code not in (401, 403)
can be fooled by gateway-generated 404/500/503; update the requests.post block
to target a stable httpbin endpoint (e.g. use the /get path) and assert a
response characteristic unique to that endpoint: require r.status_code == 200
and validate the returned payload (e.g. call r.json() and assert a field like
"url" or "origin" matches expectations) so the test proves the request reached
the external service; locate the requests.post call and variables (r, url,
headers, body, TLS_VERIFY, log) and replace the loose auth-only assertion with
the precise status and payload checks.
- Around line 44-53: The helper _apply_cr currently logs failures and returns
False which lets callers assume the resource exists; change it to fail hard by
raising an exception when oc apply returns non-zero: after running
subprocess.run in _apply_cr, if result.returncode != 0 raise a RuntimeError or
subprocess.CalledProcessError that includes the failing command,
result.returncode and result.stderr (and/or stdout) so callers cannot proceed on
partial state; keep the function name _apply_cr and preserve the subprocess.run
invocation but replace the soft-logging path with a thrown error that surfaces
the oc output.
- Line 173: Replace the fixed sleep calls (time.sleep(RECONCILE_WAIT * 2)) with
condition-based polling using the existing reconcile timeout constant
(RECONCILE_WAIT) and the actual checks used later in the test (e.g.,
existence/absence of the route or service); implement a small wait loop that
polls the cluster/client for the expected condition (route/service created or
removed) with a reasonable poll interval and a total timeout (like a multiple of
RECONCILE_WAIT), and fail the test if the timeout elapses; apply the same
replacement for the other occurrences referenced (around lines 327-335) so tests
wait for the real condition instead of sleeping.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: d4d0a86d-2ab7-4e4e-8e71-572891bef056
📒 Files selected for processing (2)
test/e2e/scripts/prow_run_smoke_test.shtest/e2e/tests/test_external_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/scripts/prow_run_smoke_test.sh
| result = subprocess.run( | ||
| ["oc", "apply", "-f", "-"], | ||
| input=json.dumps(cr_dict), | ||
| capture_output=True, text=True, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l test/e2e/tests/test_external_models.pyRepository: opendatahub-io/models-as-a-service
Length of output: 120
🏁 Script executed:
cat -n test/e2e/tests/test_external_models.py | sed -n '40,80p'Repository: opendatahub-io/models-as-a-service
Length of output: 1695
Add subprocess timeouts to all oc helper calls.
These external process calls lack timeout= parameters on subprocess.run(). If oc or the Kubernetes API server stalls, the process hangs at the OS level and the fixture never reaches cleanup, deadlocking the entire E2E job. Note that the --timeout=30s flag in _delete_cr is a command-level argument to oc delete, not a subprocess timeout.
Apply to: lines 46–50, 58–61, 66–69, 74–77.
🧰 Tools
🪛 Ruff (0.15.7)
[error] 47-47: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 46 - 50, The
subprocess.run calls invoking "oc" in test/e2e/tests/test_external_models.py
currently lack a subprocess-level timeout causing potential hangs; update each
oc helper subprocess.run invocation (the calls shown around the variables result
= subprocess.run(...), the ones used to apply CRs and any subsequent oc commands
in the same test such as the calls at lines roughly 46–50, 58–61, 66–69, 74–77)
to include a sensible timeout= parameter (e.g., timeout=30) so the Python
process will raise a TimeoutExpired instead of hanging; ensure you add the
timeout argument to each subprocess.run call that runs "oc" and handle or let
the exception propagate consistently with existing test/fixture cleanup
behavior.
| def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]: | ||
| """Get a Kubernetes resource as dict, or None if not found.""" | ||
| result = subprocess.run( | ||
| ["oc", "get", kind, name, "-n", namespace, "-o", "json"], | ||
| capture_output=True, text=True, | ||
| ) | ||
| if result.returncode != 0: | ||
| return None | ||
| return json.loads(result.stdout) |
There was a problem hiding this comment.
Do not collapse every oc get failure into “not found.”
A transport/auth/CLI error returns None here exactly like a real NotFound. That can make discovery tests report missing resources and cleanup tests falsely pass when the API server is unreachable. Return None only for NotFound; raise on everything else.
Suggested fix
def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]:
"""Get a Kubernetes resource as dict, or None if not found."""
result = subprocess.run(
["oc", "get", kind, name, "-n", namespace, "-o", "json"],
capture_output=True, text=True,
)
if result.returncode != 0:
- return None
+ stderr = (result.stderr or "").lower()
+ if "notfound" in stderr or "not found" in stderr:
+ return None
+ raise RuntimeError(
+ f"oc get failed for {kind}/{namespace}/{name}: {result.stderr.strip() or result.stdout.strip()}"
+ )
return json.loads(result.stdout)As per coding guidelines, **: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.
🧰 Tools
🪛 Ruff (0.15.7)
[error] 74-74: subprocess call: check for execution of untrusted input
(S603)
[error] 75-75: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 72 - 80, The helper
_get_cr currently treats any non-zero oc exit as “not found”; change it so that
when subprocess.run returns non-zero you inspect result.stderr (and/or
result.stdout) for a Kubernetes NotFound indicator (e.g., contains "NotFound" or
"not found") and only return None in that case; for any other error
(transport/auth/CLI failure) raise an exception (RuntimeError or
subprocess.CalledProcessError) including the command, returncode and stderr so
callers can distinguish API/CLI failures from true NotFound. Ensure the change
is made inside the _get_cr function and preserves the existing successful
json.loads(result.stdout) path.
jland-redhat
left a comment
There was a problem hiding this comment.
Could we revisit the possibly deleted test.
Could we also add some logic that validates we have access to the httpbin.org before we run the E2E test and just print a warning if the connection does not work
We may run these on a disconnected environment, and I would like to eventually make like a backup plan here in the future incase that endpoint is unavailable.
| "$test_dir/tests/test_namespace_scoping.py" \ | ||
| "$test_dir/tests/test_subscription.py" \ | ||
| "$test_dir/tests/test_models_endpoint.py" \ | ||
| "$test_dir/tests/test_external_oidc.py" ; then |
The reconciler overwrites the DestinationRule on reconciliation, removing the insecureSkipVerify patch needed for the simulator's self-signed cert. Added an autouse fixture that re-patches before each test class.
e3ae602 to
16508d6
Compare
|
@jland-redhat — thanks for the review! Addressed both points: 1. Connectivity check for httpbin.org: 2. test_models_endpoint.py: Also merged the external model tests into the main pytest invocation (no longer a separate block with separate artifacts), so they run alongside all other tests and can't get lost. |
|
@srampal — these tests validate MaaS egress capabilities (discovery, auth, routing to an external endpoint) using httpbin.org. They don't test provider-specific inference or real cloud services — that's covered by the ai-gateway-payload-processing repo's E2E tests, which test the full plugin chain (request/response translation) against a simulator for all providers (OpenAI, Anthropic, Azure, Vertex). These MaaS tests focus on: does the ExternalModel reconciler create the right resources? Does auth block unauthenticated requests? Does a request with a valid key reach the external endpoint? Does cleanup work? |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
test/e2e/tests/test_external_models.py (2)
213-214:⚠️ Potential issue | 🟠 MajorStop logging API key material (CWE-532).
api_key[:15]writes 15 characters of live credential to CI logs and archived artifacts. Anyone with log access gains partial secret material. Log only a constant placeholder or the key name.Proposed fix
api_key = r.json().get("key") - log.info(f"API key created: {api_key[:15]}...") + log.info("API key created for e2e-external-model-key")As per coding guidelines,
**: REVIEW PRIORITIES: 1. Security vulnerabilities.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 213 - 214, The test currently logs secret material via the api_key variable (see api_key = r.json().get("key") and the log.info call); remove any substring of the real key from logs and replace the log message with a constant placeholder or the key identifier (e.g., "API key created: <REDACTED>" or "API key created for test-user") so no live credential characters are written to CI/artifacts; update the log.info invocation that references api_key to use the placeholder instead.
72-80:⚠️ Potential issue | 🟠 MajorDistinguish NotFound from transport/auth errors in
_get_cr.Returning
Nonefor any non-zero exit conflates "resource doesn't exist" with "cluster unreachable" or "auth failed". Discovery tests will report missing resources and cleanup tests will falsely pass when the API server is down.Proposed fix
def _get_cr(kind: str, name: str, namespace: str) -> Optional[dict]: """Get a Kubernetes resource as dict, or None if not found.""" result = subprocess.run( ["oc", "get", kind, name, "-n", namespace, "-o", "json"], - capture_output=True, text=True, + capture_output=True, text=True, timeout=30, ) if result.returncode != 0: + stderr = (result.stderr or "").lower() + if "notfound" in stderr or "not found" in stderr: + return None + raise RuntimeError( + f"oc get {kind}/{name} in {namespace} failed: {result.stderr.strip()}" + ) - return None return json.loads(result.stdout)As per coding guidelines,
**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/tests/test_external_models.py` around lines 72 - 80, The helper _get_cr currently returns None for any non-zero exit from the subprocess.run call, which hides transport/auth errors; change its handling so that after running the oc command (the subprocess.run in _get_cr) you inspect result.returncode and result.stderr: if the stderr clearly indicates the resource is missing (e.g., contains "NotFound" or "not found"), return None, but for any other non-zero exit treat it as an operational error (raise a RuntimeError or re-raise a CalledProcessError with the command, returncode and stderr) so callers can distinguish cluster/auth problems from absent resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 44-53: Subprocess calls to `oc` in helper functions can hang
because they lack a Python-level timeout; update `_apply_cr` (and similarly
`_delete_cr`, `_patch_cr`, `_get_cr`) to pass a `timeout=` argument to
`subprocess.run()` (e.g., 30s) and handle `subprocess.TimeoutExpired` by logging
the timeout error (using existing `log.warning` or similar) and returning
failure (False or non-zero) so fixtures/cleanup don’t deadlock; ensure the
behavior on normal non-zero return codes remains unchanged.
- Around line 122-229: The fixture external_models_setup can leak applied
resources if setup fails before yield; wrap the entire setup and API key
creation in a try/finally (or register a pytest request.addfinalizer) so the
cleanup block that uses _delete_cr and _patch_cr always runs even on exceptions;
specifically, perform all _apply_cr calls and the requests.post inside a try,
assign the api_key/gateway_url to a local result dict, yield that dict inside
the try, and move the existing cleanup calls (_delete_cr("maasauthpolicy", ...),
_delete_cr("maassubscription", ...), _patch_cr("maasmodelref", ...),
_delete_cr("maasmodelref", ...), _delete_cr("externalmodel", ...),
_delete_cr("secret", ...)) into the finally block so they execute whether
pytest.fail()/exceptions occur before or after yield.
---
Duplicate comments:
In `@test/e2e/tests/test_external_models.py`:
- Around line 213-214: The test currently logs secret material via the api_key
variable (see api_key = r.json().get("key") and the log.info call); remove any
substring of the real key from logs and replace the log message with a constant
placeholder or the key identifier (e.g., "API key created: <REDACTED>" or "API
key created for test-user") so no live credential characters are written to
CI/artifacts; update the log.info invocation that references api_key to use the
placeholder instead.
- Around line 72-80: The helper _get_cr currently returns None for any non-zero
exit from the subprocess.run call, which hides transport/auth errors; change its
handling so that after running the oc command (the subprocess.run in _get_cr)
you inspect result.returncode and result.stderr: if the stderr clearly indicates
the resource is missing (e.g., contains "NotFound" or "not found"), return None,
but for any other non-zero exit treat it as an operational error (raise a
RuntimeError or re-raise a CalledProcessError with the command, returncode and
stderr) so callers can distinguish cluster/auth problems from absent resources.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 74438079-7a7a-49bc-848c-81d41c353a1e
📒 Files selected for processing (2)
test/e2e/scripts/prow_run_smoke_test.shtest/e2e/tests/test_external_models.py
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/scripts/prow_run_smoke_test.sh
| def _apply_cr(cr_dict: dict): | ||
| """Apply a Kubernetes CR from a dict.""" | ||
| result = subprocess.run( | ||
| ["oc", "apply", "-f", "-"], | ||
| input=json.dumps(cr_dict), | ||
| capture_output=True, text=True, | ||
| ) | ||
| if result.returncode != 0: | ||
| log.warning(f"oc apply failed: {result.stderr}") | ||
| return result.returncode == 0 |
There was a problem hiding this comment.
Add subprocess timeout to prevent CI job deadlock.
All subprocess.run() calls to oc lack a timeout= parameter. If the Kubernetes API server stalls or oc hangs, the fixture never reaches cleanup and the entire E2E job deadlocks. The --timeout=30s flag in _delete_cr is a command-level argument to oc delete, not a Python subprocess timeout.
Proposed fix for all helpers
def _apply_cr(cr_dict: dict):
"""Apply a Kubernetes CR from a dict."""
result = subprocess.run(
["oc", "apply", "-f", "-"],
input=json.dumps(cr_dict),
- capture_output=True, text=True,
+ capture_output=True, text=True, timeout=60,
)Apply similar changes to _delete_cr, _patch_cr, and _get_cr.
🧰 Tools
🪛 Ruff (0.15.7)
[error] 47-47: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 44 - 53, Subprocess
calls to `oc` in helper functions can hang because they lack a Python-level
timeout; update `_apply_cr` (and similarly `_delete_cr`, `_patch_cr`, `_get_cr`)
to pass a `timeout=` argument to `subprocess.run()` (e.g., 30s) and handle
`subprocess.TimeoutExpired` by logging the timeout error (using existing
`log.warning` or similar) and returning failure (False or non-zero) so
fixtures/cleanup don’t deadlock; ensure the behavior on normal non-zero return
codes remains unchanged.
| @pytest.fixture(scope="module") | ||
| def external_models_setup(gateway_url, headers, api_keys_base_url): | ||
| """ | ||
| Create a single ExternalModel CR, MaaSModelRef, AuthPolicy, and | ||
| Subscription pointing to an external endpoint. Cleanup after tests. | ||
| """ | ||
| log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...") | ||
|
|
||
| # Create a dummy secret (ExternalModel requires credentialRef) | ||
| _apply_cr({ | ||
| "apiVersion": "v1", | ||
| "kind": "Secret", | ||
| "metadata": { | ||
| "name": f"{EXTERNAL_MODEL_NAME}-api-key", | ||
| "namespace": MODEL_NAMESPACE, | ||
| }, | ||
| "type": "Opaque", | ||
| "stringData": {"api-key": "e2e-test-key"}, | ||
| }) | ||
|
|
||
| # Create ExternalModel CR | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "ExternalModel", | ||
| "metadata": {"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}, | ||
| "spec": { | ||
| "provider": "openai", | ||
| "endpoint": EXTERNAL_ENDPOINT, | ||
| "credentialRef": { | ||
| "name": f"{EXTERNAL_MODEL_NAME}-api-key", | ||
| "namespace": MODEL_NAMESPACE, | ||
| }, | ||
| }, | ||
| }) | ||
|
|
||
| # Create MaaSModelRef | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "MaaSModelRef", | ||
| "metadata": { | ||
| "name": EXTERNAL_MODEL_NAME, | ||
| "namespace": MODEL_NAMESPACE, | ||
| "annotations": { | ||
| "maas.opendatahub.io/endpoint": EXTERNAL_ENDPOINT, | ||
| "maas.opendatahub.io/provider": "openai", | ||
| }, | ||
| }, | ||
| "spec": { | ||
| "modelRef": {"kind": "ExternalModel", "name": EXTERNAL_MODEL_NAME}, | ||
| }, | ||
| }) | ||
|
|
||
| # Create MaaSAuthPolicy | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "MaaSAuthPolicy", | ||
| "metadata": {"name": EXTERNAL_AUTH_POLICY, "namespace": SUBSCRIPTION_NAMESPACE}, | ||
| "spec": { | ||
| "modelRefs": [{"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}], | ||
| "subjects": {"groups": [{"name": "system:authenticated"}]}, | ||
| }, | ||
| }) | ||
|
|
||
| # Create MaaSSubscription | ||
| _apply_cr({ | ||
| "apiVersion": "maas.opendatahub.io/v1alpha1", | ||
| "kind": "MaaSSubscription", | ||
| "metadata": {"name": EXTERNAL_SUBSCRIPTION, "namespace": SUBSCRIPTION_NAMESPACE}, | ||
| "spec": { | ||
| "owner": {"groups": [{"name": "system:authenticated"}]}, | ||
| "modelRefs": [ | ||
| {"name": EXTERNAL_MODEL_NAME, "namespace": MODEL_NAMESPACE}, | ||
| ], | ||
| }, | ||
| }) | ||
|
|
||
| # Wait for reconciler + auth propagation | ||
| time.sleep(RECONCILE_WAIT * 2) | ||
|
|
||
| # Create API key for tests | ||
| log.info("Creating API key for external model tests...") | ||
| r = requests.post( | ||
| api_keys_base_url, | ||
| headers=headers, | ||
| json={"name": "e2e-external-model-key", "subscription": EXTERNAL_SUBSCRIPTION}, | ||
| timeout=30, | ||
| verify=TLS_VERIFY, | ||
| ) | ||
| if r.status_code not in (200, 201): | ||
| pytest.fail(f"Failed to create API key: {r.status_code} {r.text}") | ||
|
|
||
| api_key = r.json().get("key") | ||
| log.info(f"API key created: {api_key[:15]}...") | ||
|
|
||
| yield { | ||
| "api_key": api_key, | ||
| "gateway_url": gateway_url, | ||
| } | ||
|
|
||
| # ── Cleanup ── | ||
| log.info("Cleaning up external model test fixtures...") | ||
| _delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE) | ||
| _delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE) | ||
| _patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE, | ||
| {"metadata": {"finalizers": []}}) | ||
| _delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE) | ||
| _delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE) | ||
| _delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE) |
There was a problem hiding this comment.
Fixture leaks resources if setup fails before yield.
If pytest.fail() at line 211 is triggered (API key creation fails), the Secret, ExternalModel, MaaSModelRef, MaaSAuthPolicy, and MaaSSubscription already applied by _apply_cr are never cleaned up. Pytest does not execute teardown code after yield when an exception occurs before reaching yield.
Proposed fix using try/finally
`@pytest.fixture`(scope="module")
def external_models_setup(gateway_url, headers, api_keys_base_url):
log.info(f"Setting up external model test fixture (endpoint: {EXTERNAL_ENDPOINT})...")
+ api_key = None
+ try:
+ # Create a dummy secret (ExternalModel requires credentialRef)
+ _apply_cr({...})
+ # ... all other _apply_cr calls ...
- # Wait for reconciler + auth propagation
- time.sleep(RECONCILE_WAIT * 2)
+ time.sleep(RECONCILE_WAIT * 2)
- # Create API key for tests
- log.info("Creating API key for external model tests...")
- r = requests.post(...)
- if r.status_code not in (200, 201):
- pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")
+ r = requests.post(...)
+ if r.status_code not in (200, 201):
+ pytest.fail(f"Failed to create API key: {r.status_code} {r.text}")
- api_key = r.json().get("key")
- log.info(f"API key created: {api_key[:15]}...")
+ api_key = r.json().get("key")
+ log.info("API key created for e2e-external-model-key")
- yield {...}
+ yield {"api_key": api_key, "gateway_url": gateway_url}
+ finally:
+ log.info("Cleaning up external model test fixtures...")
+ _delete_cr("maasauthpolicy", EXTERNAL_AUTH_POLICY, SUBSCRIPTION_NAMESPACE)
+ _delete_cr("maassubscription", EXTERNAL_SUBSCRIPTION, SUBSCRIPTION_NAMESPACE)
+ _patch_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE,
+ {"metadata": {"finalizers": []}})
+ _delete_cr("maasmodelref", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
+ _delete_cr("externalmodel", EXTERNAL_MODEL_NAME, MODEL_NAMESPACE)
+ _delete_cr("secret", f"{EXTERNAL_MODEL_NAME}-api-key", MODEL_NAMESPACE)
- # ── Cleanup ──
- log.info("Cleaning up external model test fixtures...")
- ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/tests/test_external_models.py` around lines 122 - 229, The fixture
external_models_setup can leak applied resources if setup fails before yield;
wrap the entire setup and API key creation in a try/finally (or register a
pytest request.addfinalizer) so the cleanup block that uses _delete_cr and
_patch_cr always runs even on exceptions; specifically, perform all _apply_cr
calls and the requests.post inside a try, assign the api_key/gateway_url to a
local result dict, yield that dict inside the try, and move the existing cleanup
calls (_delete_cr("maasauthpolicy", ...), _delete_cr("maassubscription", ...),
_patch_cr("maasmodelref", ...), _delete_cr("maasmodelref", ...),
_delete_cr("externalmodel", ...), _delete_cr("secret", ...)) into the finally
block so they execute whether pytest.fail()/exceptions occur before or after
yield.
|
@noyitz: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |

Summary
E2E tests for the ExternalModel feature, focused on MaaS capabilities:
Uses
httpbin.orgas the external endpoint (configurable viaE2E_EXTERNAL_ENDPOINT). No BBR/plugin dependency — tests validateMaaS egress routing and auth, not payload transformation.
Changes
test/e2e/tests/test_external_models.py: 7 tests covering discovery,auth, egress connectivity, and cleanup
test/e2e/scripts/prow_run_smoke_test.sh: External model tests section(commented out until CI includes ExternalModel reconciler)
Test plan
Summary by CodeRabbit
Tests
Chores