Fix multi-node test for kserve by mwaykole · Pull Request #1144 · opendatahub-io/opendatahub-tests

mwaykole · 2026-02-25T06:52:48Z

Summary

Related Issues

Fixes:
JIRA:

How it has been tested

Locally
Jenkins

Additional Requirements

If this PR introduces a new test image, did you create a PR to mirror it in disconnected environment?
If this PR introduces new marker(s)/adds a new component, was relevant ticket created to update relevant Jenkins job?

Summary by CodeRabbit

Bug Fixes
- Detect and surface HTTP 500 inference errors.
- Improve pod failure detection and readiness checks for more accurate failure reporting.
- Relax model output validation to tolerate minor structural variations.
Tests
- Strengthen multi-node inference tests with active health probing, polling-based synchronization, and automated recovery validation.
Documentation
- Add detailed health-probing docstrings and logging to test helpers.

github-actions · 2026-02-25T06:53:04Z

The following are automatically added/executed:

PR size label.
Run pre-commit
Run tox
Add PR author as the PR assignee
Build image based on the PR

Available user actions:

To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
lgtm label removed on each new commit push.
To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
verified label removed on each new commit push.
To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.

Supported labels

{'/verified', '/cherry-pick', '/wip', '/build-push-pr-image', '/hold', '/lgtm'}

Signed-off-by: Milind waykole <mwaykole@redhat.com>

coderabbitai · 2026-02-25T06:59:10Z

📝 Walkthrough

Walkthrough

Adds health-probing and recovery checks to multi-node KServe tests, refactors pod-failure detection in infra utilities for finer-grained evaluation, introduces HTTP 500 detection in inference runner, and relaxes a VLLM output regex to allow minor response variations.

Changes

Cohort / File(s)	Summary
Multi-node test fixtures and helpers `tests/model_serving/model_server/kserve/multi_node/conftest.py`	`patched_multi_node_spec` signature now accepts `unprivileged_client`; autoscaler_mode changed from `"external"` to `"none"`. Added polling for pod generation via `get_pods_by_isvc_generation`, added `_warmup_inference_and_wait_for_recovery` and `_probe_inference_health` helpers (curl-based /v1/completions probe), and extended `deleted_multi_node_pod` to wait for replicas and trigger recovery probe. LOGGER and imports (`ApiException`, `get_logger`, `WORKER_POD_ROLE`) added.
Inference utilities `utilities/inference_utils.py`	`run_inference` now detects HTTP 500 INTERNAL_SERVER_ERROR text in command output and raises `InferenceResponseError` for 500 responses in addition to existing 503 handling.
Infrastructure pod verification `utilities/infra.py`	`verify_no_failed_pods` refactored: initializes container error sets outside per-pod loop, handles `CRASH_LOOPBACK_OFF` differently based on deploymentMode, combines init and regular container statuses for failure checks, treats missing pods as non-fatal in the loop, raises `FailedPodsError` if any pod is in a failing state, otherwise verifies all pods reach READY.
Model inference configurations `utilities/manifests/vllm.py`	Relaxed the `VLLM_INFERENCE_CONFIG.default_query_model.query_output` regex to permit additional/wildcarded segments between fields and at the end, broadening allowed response structure.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is entirely template placeholder text with no actual content filled in; all sections are empty or contain only HTML comments.	Complete the description by: (1) summarizing the changes made to multi-node test fixtures and utilities; (2) filling in related issues/JIRA references; (3) specifying how the changes were tested (locally and/or Jenkins); (4) answering additional requirements questions.
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Fix multi-node test for kserve' is vague and generic; it lacks specificity about what is being fixed in the multi-node test.	Clarify the title with specific details, such as 'Fix multi-node test health probing for kserve' or 'Fix multi-node test pod recovery for kserve' to better convey the main change.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/model_serving/model_server/kserve/multi_node/conftest.py (2)
330-332: Worker pod filtering by name substring is fragile but likely sufficient.

WORKER_POD_ROLE in pod.name relies on the pod naming convention containing the role string. If the naming convention changes, this could silently include worker pods. Consider matching on a pod label if one exists for the role, for more robust filtering.
#!/bin/bash
# Check what WORKER_POD_ROLE is defined as and if pods have role labels
rg -n "WORKER_POD_ROLE" --type py -C3
rg -n "pod-role\|role.*worker\|worker.*role" --type py -C3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/kserve/multi_node/conftest.py` around lines
330 - 332, The current loop filters worker pods using the fragile substring
check WORKER_POD_ROLE in pod.name; update the filter in the loop that iterates
get_pods_by_isvc_label(client=client, isvc=isvc) to prefer an explicit pod label
check (e.g., pod.labels.get('role') or pod.labels.get('pod-role') ==
WORKER_POD_ROLE) and only fall back to the pod.name substring check if no role
label exists; modify the loop body that references WORKER_POD_ROLE and pod.name
to first inspect pod.labels, then continue for matching worker-role pods.
321-348: Broad except Exception is pragmatic here but consider narrowing slightly.

The broad exception catch on line 344 is flagged by static analysis (BLE001). In this probing context, it's reasonable to catch broadly since the probe should return False on any failure. However, catching more specific exceptions (e.g., kubernetes.client.exceptions.ApiException, IOError) would avoid silently swallowing truly unexpected errors like KeyboardInterrupt (though Exception doesn't catch that).

This is a minor observation — the current approach is acceptable for test utility code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/kserve/multi_node/conftest.py` around lines
321 - 348, In _probe_inference_health, replace the broad "except Exception as
exc:" around the pod.execute(command=cmd) call with a narrower catch that
handles expected probe failures (e.g., except
(kubernetes.client.exceptions.ApiException, IOError, OSError) as exc:) and
log/return False for those, but re-raise truly fatal exceptions (SystemExit,
KeyboardInterrupt) if encountered; reference the pod.execute(...) call and the
except block around it when making this change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 24-26: The LOGGER assignment is placed between import statements;
move the module-level LOGGER = get_logger(name=__name__) so that all imports
(including from utilities.general import download_model_data) come first, then
define LOGGER; ensure LOGGER uses get_logger and remains a top-level variable
after the import block and before any other module-level logic in conftest.py.
- Around line 217-225: TimeoutSampler is currently calling
get_pods_by_isvc_generation which raises ResourceNotFoundError when no pods
exist, causing the exception to escape instead of retrying; fix this by passing
exceptions_dict={ResourceNotFoundError: []} into the TimeoutSampler invocation
(the loop that constructs TimeoutSampler with
wait_timeout=Timeout.TIMEOUT_10MIN, sleep=10, func=get_pods_by_isvc_generation,
client=unprivileged_client, isvc=multi_node_inference_service) so the sampler
treats ResourceNotFoundError as a retriable condition and continues polling.

In `@utilities/infra.py`:
- Around line 811-812: Remove the invalid container-level reason from the
pod-phase check: delete CRASH_LOOPBACK_OFF from the tuple used when testing
pod_status.phase so the condition only checks valid pod phases (e.g.,
pod.Status.FAILED); update the code around the failed_pods population (the if
that references pod_status.phase and adds to failed_pods[pod.name]) to only
compare pod_status.phase against pod.Status.FAILED (leave container-level
CRASH_LOOPBACK_OFF handling where container state errors are collected).

---

Nitpick comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 330-332: The current loop filters worker pods using the fragile
substring check WORKER_POD_ROLE in pod.name; update the filter in the loop that
iterates get_pods_by_isvc_label(client=client, isvc=isvc) to prefer an explicit
pod label check (e.g., pod.labels.get('role') or pod.labels.get('pod-role') ==
WORKER_POD_ROLE) and only fall back to the pod.name substring check if no role
label exists; modify the loop body that references WORKER_POD_ROLE and pod.name
to first inspect pod.labels, then continue for matching worker-role pods.
- Around line 321-348: In _probe_inference_health, replace the broad "except
Exception as exc:" around the pod.execute(command=cmd) call with a narrower
catch that handles expected probe failures (e.g., except
(kubernetes.client.exceptions.ApiException, IOError, OSError) as exc:) and
log/return False for those, but re-raise truly fatal exceptions (SystemExit,
KeyboardInterrupt) if encountered; reference the pod.execute(...) call and the
except block around it when making this change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3a82bc8 and cfdc70a.

📒 Files selected for processing (4)

tests/model_serving/model_server/kserve/multi_node/conftest.py
utilities/inference_utils.py
utilities/infra.py
utilities/manifests/vllm.py

Signed-off-by: Milind waykole <mwaykole@redhat.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

utilities/infra.py (1)

804-805: ⚠️ Potential issue | 🟡 Minor

Remove CRASH_LOOPBACK_OFF from pod phase check.

On Line 804, CrashLoopBackOff is not a pod phase, so this condition is dead/misleading. Keep it only in container waiting/terminated reason checks.

Suggested fix

-            if pod_status.phase in (pod.Status.CRASH_LOOPBACK_OFF, pod.Status.FAILED):
+            if pod_status.phase == pod.Status.FAILED:
                 failed_pods[pod.name] = pod_status

In Kubernetes, is CrashLoopBackOff a Pod phase or a container waiting reason? Also list the valid Pod phase values from official Kubernetes documentation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@utilities/infra.py` around lines 804 - 805, The check comparing
pod_status.phase against pod.Status.CRASH_LOOPBACK_OFF is incorrect because
CrashLoopBackOff is a container waiting/termination reason, not a Pod phase;
update the conditional in the block that populates failed_pods so it only checks
pod_status.phase == pod.Status.FAILED (or other real Pod phases) and remove
pod.Status.CRASH_LOOPBACK_OFF from that phase check, and ensure any
CrashLoopBackOff detection is performed elsewhere by inspecting container
statuses (e.g., iterating pod.status.container_statuses and checking each
container.state.waiting.reason / container.state.terminated.reason for
pod.Status.CRASH_LOOPBACK_OFF).

tests/model_serving/model_server/kserve/multi_node/conftest.py (1)

219-225: ⚠️ Potential issue | 🔴 Critical

Reintroduced polling breakage: missing retriable exception config in sampler.

This is the same issue previously raised: when get_pods_by_isvc_generation raises ResourceNotFoundError, polling exits early instead of retrying.

Suggested fix

         for sample in TimeoutSampler(
             wait_timeout=Timeout.TIMEOUT_10MIN,
             sleep=10,
+            exceptions_dict={ResourceNotFoundError: []},
             func=get_pods_by_isvc_generation,
             client=unprivileged_client,
             isvc=multi_node_inference_service,
         ):

#!/bin/bash
# Confirm helper behavior and sampler configuration.
UTILS_FILE=$(fd -t f "utils.py" tests/model_serving/model_server/kserve/multi_node | head -n1)
echo "utils file: ${UTILS_FILE}"
rg -n "def get_pods_by_isvc_generation|raise .*ResourceNotFoundError" "${UTILS_FILE}" -A20 -B4
rg -n "patched_multi_node_spec|TimeoutSampler\\(" tests/model_serving/model_server/kserve/multi_node/conftest.py -A20 -B4

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/kserve/multi_node/conftest.py` around lines
219 - 225, The TimeoutSampler invocation around get_pods_by_isvc_generation is
missing configuration to treat ResourceNotFoundError as retriable, so the
sampler exits early when get_pods_by_isvc_generation raises
ResourceNotFoundError; update the TimeoutSampler call (the instance created in
conftest.py that currently passes wait_timeout, sleep,
func=get_pods_by_isvc_generation, client=unprivileged_client,
isvc=multi_node_inference_service) to include the retriable exceptions parameter
(e.g., retriable_exceptions or retry_exceptions depending on TimeoutSampler API)
and pass ResourceNotFoundError so the sampler will continue polling until
success instead of failing on the first ResourceNotFoundError from
get_pods_by_isvc_generation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 319-326: _wrap the call to get_pods_by_isvc_label inside
_probe_inference_health with a try/except for ResourceNotFoundError and treat
that case as “no pods yet” (return False or otherwise indicate unhealthy) so the
TimeoutSampler loop keeps retrying; specifically catch ResourceNotFoundError
thrown by get_pods_by_isvc_label, log/ignore it as needed, and ensure
_probe_inference_health returns a falsy value instead of letting the exception
propagate so TimeoutSampler (used in the for loop) can continue retries._

---

Duplicate comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 219-225: The TimeoutSampler invocation around
get_pods_by_isvc_generation is missing configuration to treat
ResourceNotFoundError as retriable, so the sampler exits early when
get_pods_by_isvc_generation raises ResourceNotFoundError; update the
TimeoutSampler call (the instance created in conftest.py that currently passes
wait_timeout, sleep, func=get_pods_by_isvc_generation,
client=unprivileged_client, isvc=multi_node_inference_service) to include the
retriable exceptions parameter (e.g., retriable_exceptions or retry_exceptions
depending on TimeoutSampler API) and pass ResourceNotFoundError so the sampler
will continue polling until success instead of failing on the first
ResourceNotFoundError from get_pods_by_isvc_generation.

In `@utilities/infra.py`:
- Around line 804-805: The check comparing pod_status.phase against
pod.Status.CRASH_LOOPBACK_OFF is incorrect because CrashLoopBackOff is a
container waiting/termination reason, not a Pod phase; update the conditional in
the block that populates failed_pods so it only checks pod_status.phase ==
pod.Status.FAILED (or other real Pod phases) and remove
pod.Status.CRASH_LOOPBACK_OFF from that phase check, and ensure any
CrashLoopBackOff detection is performed elsewhere by inspecting container
statuses (e.g., iterating pod.status.container_statuses and checking each
container.state.waiting.reason / container.state.terminated.reason for
pod.Status.CRASH_LOOPBACK_OFF).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between cfdc70a and 305ba69.

📒 Files selected for processing (4)

tests/model_serving/model_server/kserve/multi_node/conftest.py
utilities/inference_utils.py
utilities/infra.py
utilities/manifests/vllm.py

🚧 Files skipped from review as they are similar to previous changes (1)

utilities/manifests/vllm.py

github-actions · 2026-02-25T08:59:59Z

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

mwaykole requested review from a team, Raghul-M, brettmthompson and threcc as code owners February 25, 2026 06:52

github-actions Bot added ModelServing Utilities labels Feb 25, 2026

github-actions Bot assigned mwaykole Feb 25, 2026

github-actions Bot added the size/l label Feb 25, 2026

Fix multinode test for kserve

3bcc0ef

Signed-off-by: Milind waykole <mwaykole@redhat.com>

coderabbitai Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread tests/model_serving/model_server/kserve/multi_node/conftest.py Outdated

Comment thread tests/model_serving/model_server/kserve/multi_node/conftest.py

Comment thread utilities/infra.py

Fix multinode test for kserve

305ba69

Signed-off-by: Milind waykole <mwaykole@redhat.com>

mwaykole force-pushed the smoke-refactor branch from cfdc70a to 305ba69 Compare February 25, 2026 07:04

coderabbitai Bot reviewed Feb 25, 2026

View reviewed changes

Comment thread tests/model_serving/model_server/kserve/multi_node/conftest.py

Raghul-M approved these changes Feb 25, 2026

View reviewed changes

rhods-ci-bot added the lgtm-by-Raghul-M label Feb 25, 2026

threcc approved these changes Feb 25, 2026

View reviewed changes

rhods-ci-bot added the lgtm-by-threcc label Feb 25, 2026

Merge branch 'main' into smoke-refactor

2706849

mwaykole enabled auto-merge (squash) February 25, 2026 08:49

rhods-ci-bot removed lgtm-by-Raghul-M lgtm-by-threcc labels Feb 25, 2026

Merge branch 'main' into smoke-refactor

ed59a8d

mwaykole merged commit 6e1e3f0 into opendatahub-io:main Feb 25, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-node test for kserve#1144

Fix multi-node test for kserve#1144
mwaykole merged 4 commits intoopendatahub-io:mainfrom
mwaykole:smoke-refactor

mwaykole commented Feb 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

coderabbitai Bot commented Feb 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mwaykole commented Feb 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

How it has been tested

Additional Requirements

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

coderabbitai Bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mwaykole commented Feb 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 25, 2026 •

edited

Loading