Skip to content

Fix multi-node test for kserve#1144

Merged
mwaykole merged 4 commits intoopendatahub-io:mainfrom
mwaykole:smoke-refactor
Feb 25, 2026
Merged

Fix multi-node test for kserve#1144
mwaykole merged 4 commits intoopendatahub-io:mainfrom
mwaykole:smoke-refactor

Conversation

@mwaykole
Copy link
Copy Markdown
Member

@mwaykole mwaykole commented Feb 25, 2026

Summary

Related Issues

  • Fixes:
  • JIRA:

How it has been tested

  • Locally
  • Jenkins

Additional Requirements

  • If this PR introduces a new test image, did you create a PR to mirror it in disconnected environment?
  • If this PR introduces new marker(s)/adds a new component, was relevant ticket created to update relevant Jenkins job?

Summary by CodeRabbit

  • Bug Fixes

    • Detect and surface HTTP 500 inference errors.
    • Improve pod failure detection and readiness checks for more accurate failure reporting.
    • Relax model output validation to tolerate minor structural variations.
  • Tests

    • Strengthen multi-node inference tests with active health probing, polling-based synchronization, and automated recovery validation.
  • Documentation

    • Add detailed health-probing docstrings and logging to test helpers.

@github-actions
Copy link
Copy Markdown

The following are automatically added/executed:

  • PR size label.
  • Run pre-commit
  • Run tox
  • Add PR author as the PR assignee
  • Build image based on the PR

Available user actions:

  • To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
  • To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
  • To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
    lgtm label removed on each new commit push.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
    and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
  • To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
    pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.
Supported labels

{'/verified', '/cherry-pick', '/wip', '/build-push-pr-image', '/hold', '/lgtm'}

Signed-off-by: Milind waykole <mwaykole@redhat.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 25, 2026

📝 Walkthrough

Walkthrough

Adds health-probing and recovery checks to multi-node KServe tests, refactors pod-failure detection in infra utilities for finer-grained evaluation, introduces HTTP 500 detection in inference runner, and relaxes a VLLM output regex to allow minor response variations.

Changes

Cohort / File(s) Summary
Multi-node test fixtures and helpers
tests/model_serving/model_server/kserve/multi_node/conftest.py
patched_multi_node_spec signature now accepts unprivileged_client; autoscaler_mode changed from "external" to "none". Added polling for pod generation via get_pods_by_isvc_generation, added _warmup_inference_and_wait_for_recovery and _probe_inference_health helpers (curl-based /v1/completions probe), and extended deleted_multi_node_pod to wait for replicas and trigger recovery probe. LOGGER and imports (ApiException, get_logger, WORKER_POD_ROLE) added.
Inference utilities
utilities/inference_utils.py
run_inference now detects HTTP 500 INTERNAL_SERVER_ERROR text in command output and raises InferenceResponseError for 500 responses in addition to existing 503 handling.
Infrastructure pod verification
utilities/infra.py
verify_no_failed_pods refactored: initializes container error sets outside per-pod loop, handles CRASH_LOOPBACK_OFF differently based on deploymentMode, combines init and regular container statuses for failure checks, treats missing pods as non-fatal in the loop, raises FailedPodsError if any pod is in a failing state, otherwise verifies all pods reach READY.
Model inference configurations
utilities/manifests/vllm.py
Relaxed the VLLM_INFERENCE_CONFIG.default_query_model.query_output regex to permit additional/wildcarded segments between fields and at the end, broadening allowed response structure.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is entirely template placeholder text with no actual content filled in; all sections are empty or contain only HTML comments. Complete the description by: (1) summarizing the changes made to multi-node test fixtures and utilities; (2) filling in related issues/JIRA references; (3) specifying how the changes were tested (locally and/or Jenkins); (4) answering additional requirements questions.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Fix multi-node test for kserve' is vague and generic; it lacks specificity about what is being fixed in the multi-node test. Clarify the title with specific details, such as 'Fix multi-node test health probing for kserve' or 'Fix multi-node test pod recovery for kserve' to better convey the main change.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/model_serving/model_server/kserve/multi_node/conftest.py (2)

330-332: Worker pod filtering by name substring is fragile but likely sufficient.

WORKER_POD_ROLE in pod.name relies on the pod naming convention containing the role string. If the naming convention changes, this could silently include worker pods. Consider matching on a pod label if one exists for the role, for more robust filtering.

#!/bin/bash
# Check what WORKER_POD_ROLE is defined as and if pods have role labels
rg -n "WORKER_POD_ROLE" --type py -C3
rg -n "pod-role\|role.*worker\|worker.*role" --type py -C3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/kserve/multi_node/conftest.py` around lines
330 - 332, The current loop filters worker pods using the fragile substring
check WORKER_POD_ROLE in pod.name; update the filter in the loop that iterates
get_pods_by_isvc_label(client=client, isvc=isvc) to prefer an explicit pod label
check (e.g., pod.labels.get('role') or pod.labels.get('pod-role') ==
WORKER_POD_ROLE) and only fall back to the pod.name substring check if no role
label exists; modify the loop body that references WORKER_POD_ROLE and pod.name
to first inspect pod.labels, then continue for matching worker-role pods.

321-348: Broad except Exception is pragmatic here but consider narrowing slightly.

The broad exception catch on line 344 is flagged by static analysis (BLE001). In this probing context, it's reasonable to catch broadly since the probe should return False on any failure. However, catching more specific exceptions (e.g., kubernetes.client.exceptions.ApiException, IOError) would avoid silently swallowing truly unexpected errors like KeyboardInterrupt (though Exception doesn't catch that).

This is a minor observation — the current approach is acceptable for test utility code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/kserve/multi_node/conftest.py` around lines
321 - 348, In _probe_inference_health, replace the broad "except Exception as
exc:" around the pod.execute(command=cmd) call with a narrower catch that
handles expected probe failures (e.g., except
(kubernetes.client.exceptions.ApiException, IOError, OSError) as exc:) and
log/return False for those, but re-raise truly fatal exceptions (SystemExit,
KeyboardInterrupt) if encountered; reference the pod.execute(...) call and the
except block around it when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 24-26: The LOGGER assignment is placed between import statements;
move the module-level LOGGER = get_logger(name=__name__) so that all imports
(including from utilities.general import download_model_data) come first, then
define LOGGER; ensure LOGGER uses get_logger and remains a top-level variable
after the import block and before any other module-level logic in conftest.py.
- Around line 217-225: TimeoutSampler is currently calling
get_pods_by_isvc_generation which raises ResourceNotFoundError when no pods
exist, causing the exception to escape instead of retrying; fix this by passing
exceptions_dict={ResourceNotFoundError: []} into the TimeoutSampler invocation
(the loop that constructs TimeoutSampler with
wait_timeout=Timeout.TIMEOUT_10MIN, sleep=10, func=get_pods_by_isvc_generation,
client=unprivileged_client, isvc=multi_node_inference_service) so the sampler
treats ResourceNotFoundError as a retriable condition and continues polling.

In `@utilities/infra.py`:
- Around line 811-812: Remove the invalid container-level reason from the
pod-phase check: delete CRASH_LOOPBACK_OFF from the tuple used when testing
pod_status.phase so the condition only checks valid pod phases (e.g.,
pod.Status.FAILED); update the code around the failed_pods population (the if
that references pod_status.phase and adds to failed_pods[pod.name]) to only
compare pod_status.phase against pod.Status.FAILED (leave container-level
CRASH_LOOPBACK_OFF handling where container state errors are collected).

---

Nitpick comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 330-332: The current loop filters worker pods using the fragile
substring check WORKER_POD_ROLE in pod.name; update the filter in the loop that
iterates get_pods_by_isvc_label(client=client, isvc=isvc) to prefer an explicit
pod label check (e.g., pod.labels.get('role') or pod.labels.get('pod-role') ==
WORKER_POD_ROLE) and only fall back to the pod.name substring check if no role
label exists; modify the loop body that references WORKER_POD_ROLE and pod.name
to first inspect pod.labels, then continue for matching worker-role pods.
- Around line 321-348: In _probe_inference_health, replace the broad "except
Exception as exc:" around the pod.execute(command=cmd) call with a narrower
catch that handles expected probe failures (e.g., except
(kubernetes.client.exceptions.ApiException, IOError, OSError) as exc:) and
log/return False for those, but re-raise truly fatal exceptions (SystemExit,
KeyboardInterrupt) if encountered; reference the pod.execute(...) call and the
except block around it when making this change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3a82bc8 and cfdc70a.

📒 Files selected for processing (4)
  • tests/model_serving/model_server/kserve/multi_node/conftest.py
  • utilities/inference_utils.py
  • utilities/infra.py
  • utilities/manifests/vllm.py

Comment thread tests/model_serving/model_server/kserve/multi_node/conftest.py Outdated
Comment thread tests/model_serving/model_server/kserve/multi_node/conftest.py
Comment thread utilities/infra.py
Signed-off-by: Milind waykole <mwaykole@redhat.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
utilities/infra.py (1)

804-805: ⚠️ Potential issue | 🟡 Minor

Remove CRASH_LOOPBACK_OFF from pod phase check.

On Line 804, CrashLoopBackOff is not a pod phase, so this condition is dead/misleading. Keep it only in container waiting/terminated reason checks.

Suggested fix
-            if pod_status.phase in (pod.Status.CRASH_LOOPBACK_OFF, pod.Status.FAILED):
+            if pod_status.phase == pod.Status.FAILED:
                 failed_pods[pod.name] = pod_status
In Kubernetes, is CrashLoopBackOff a Pod phase or a container waiting reason? Also list the valid Pod phase values from official Kubernetes documentation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@utilities/infra.py` around lines 804 - 805, The check comparing
pod_status.phase against pod.Status.CRASH_LOOPBACK_OFF is incorrect because
CrashLoopBackOff is a container waiting/termination reason, not a Pod phase;
update the conditional in the block that populates failed_pods so it only checks
pod_status.phase == pod.Status.FAILED (or other real Pod phases) and remove
pod.Status.CRASH_LOOPBACK_OFF from that phase check, and ensure any
CrashLoopBackOff detection is performed elsewhere by inspecting container
statuses (e.g., iterating pod.status.container_statuses and checking each
container.state.waiting.reason / container.state.terminated.reason for
pod.Status.CRASH_LOOPBACK_OFF).
tests/model_serving/model_server/kserve/multi_node/conftest.py (1)

219-225: ⚠️ Potential issue | 🔴 Critical

Reintroduced polling breakage: missing retriable exception config in sampler.

This is the same issue previously raised: when get_pods_by_isvc_generation raises ResourceNotFoundError, polling exits early instead of retrying.

Suggested fix
         for sample in TimeoutSampler(
             wait_timeout=Timeout.TIMEOUT_10MIN,
             sleep=10,
+            exceptions_dict={ResourceNotFoundError: []},
             func=get_pods_by_isvc_generation,
             client=unprivileged_client,
             isvc=multi_node_inference_service,
         ):
#!/bin/bash
# Confirm helper behavior and sampler configuration.
UTILS_FILE=$(fd -t f "utils.py" tests/model_serving/model_server/kserve/multi_node | head -n1)
echo "utils file: ${UTILS_FILE}"
rg -n "def get_pods_by_isvc_generation|raise .*ResourceNotFoundError" "${UTILS_FILE}" -A20 -B4
rg -n "patched_multi_node_spec|TimeoutSampler\\(" tests/model_serving/model_server/kserve/multi_node/conftest.py -A20 -B4
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/kserve/multi_node/conftest.py` around lines
219 - 225, The TimeoutSampler invocation around get_pods_by_isvc_generation is
missing configuration to treat ResourceNotFoundError as retriable, so the
sampler exits early when get_pods_by_isvc_generation raises
ResourceNotFoundError; update the TimeoutSampler call (the instance created in
conftest.py that currently passes wait_timeout, sleep,
func=get_pods_by_isvc_generation, client=unprivileged_client,
isvc=multi_node_inference_service) to include the retriable exceptions parameter
(e.g., retriable_exceptions or retry_exceptions depending on TimeoutSampler API)
and pass ResourceNotFoundError so the sampler will continue polling until
success instead of failing on the first ResourceNotFoundError from
get_pods_by_isvc_generation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 319-326: _wrap the call to get_pods_by_isvc_label inside
_probe_inference_health with a try/except for ResourceNotFoundError and treat
that case as “no pods yet” (return False or otherwise indicate unhealthy) so the
TimeoutSampler loop keeps retrying; specifically catch ResourceNotFoundError
thrown by get_pods_by_isvc_label, log/ignore it as needed, and ensure
_probe_inference_health returns a falsy value instead of letting the exception
propagate so TimeoutSampler (used in the for loop) can continue retries._

---

Duplicate comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 219-225: The TimeoutSampler invocation around
get_pods_by_isvc_generation is missing configuration to treat
ResourceNotFoundError as retriable, so the sampler exits early when
get_pods_by_isvc_generation raises ResourceNotFoundError; update the
TimeoutSampler call (the instance created in conftest.py that currently passes
wait_timeout, sleep, func=get_pods_by_isvc_generation,
client=unprivileged_client, isvc=multi_node_inference_service) to include the
retriable exceptions parameter (e.g., retriable_exceptions or retry_exceptions
depending on TimeoutSampler API) and pass ResourceNotFoundError so the sampler
will continue polling until success instead of failing on the first
ResourceNotFoundError from get_pods_by_isvc_generation.

In `@utilities/infra.py`:
- Around line 804-805: The check comparing pod_status.phase against
pod.Status.CRASH_LOOPBACK_OFF is incorrect because CrashLoopBackOff is a
container waiting/termination reason, not a Pod phase; update the conditional in
the block that populates failed_pods so it only checks pod_status.phase ==
pod.Status.FAILED (or other real Pod phases) and remove
pod.Status.CRASH_LOOPBACK_OFF from that phase check, and ensure any
CrashLoopBackOff detection is performed elsewhere by inspecting container
statuses (e.g., iterating pod.status.container_statuses and checking each
container.state.waiting.reason / container.state.terminated.reason for
pod.Status.CRASH_LOOPBACK_OFF).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between cfdc70a and 305ba69.

📒 Files selected for processing (4)
  • tests/model_serving/model_server/kserve/multi_node/conftest.py
  • utilities/inference_utils.py
  • utilities/infra.py
  • utilities/manifests/vllm.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • utilities/manifests/vllm.py

Comment thread tests/model_serving/model_server/kserve/multi_node/conftest.py
@mwaykole mwaykole merged commit 6e1e3f0 into opendatahub-io:main Feb 25, 2026
8 checks passed
@github-actions
Copy link
Copy Markdown

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants