Fix multi-node test for kserve#1144
Conversation
|
The following are automatically added/executed:
Available user actions:
Supported labels{'/verified', '/cherry-pick', '/wip', '/build-push-pr-image', '/hold', '/lgtm'} |
Signed-off-by: Milind waykole <mwaykole@redhat.com>
📝 WalkthroughWalkthroughAdds health-probing and recovery checks to multi-node KServe tests, refactors pod-failure detection in infra utilities for finer-grained evaluation, introduces HTTP 500 detection in inference runner, and relaxes a VLLM output regex to allow minor response variations. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/model_serving/model_server/kserve/multi_node/conftest.py (2)
330-332: Worker pod filtering by name substring is fragile but likely sufficient.
WORKER_POD_ROLE in pod.namerelies on the pod naming convention containing the role string. If the naming convention changes, this could silently include worker pods. Consider matching on a pod label if one exists for the role, for more robust filtering.#!/bin/bash # Check what WORKER_POD_ROLE is defined as and if pods have role labels rg -n "WORKER_POD_ROLE" --type py -C3 rg -n "pod-role\|role.*worker\|worker.*role" --type py -C3🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/model_serving/model_server/kserve/multi_node/conftest.py` around lines 330 - 332, The current loop filters worker pods using the fragile substring check WORKER_POD_ROLE in pod.name; update the filter in the loop that iterates get_pods_by_isvc_label(client=client, isvc=isvc) to prefer an explicit pod label check (e.g., pod.labels.get('role') or pod.labels.get('pod-role') == WORKER_POD_ROLE) and only fall back to the pod.name substring check if no role label exists; modify the loop body that references WORKER_POD_ROLE and pod.name to first inspect pod.labels, then continue for matching worker-role pods.
321-348: Broadexcept Exceptionis pragmatic here but consider narrowing slightly.The broad exception catch on line 344 is flagged by static analysis (BLE001). In this probing context, it's reasonable to catch broadly since the probe should return
Falseon any failure. However, catching more specific exceptions (e.g.,kubernetes.client.exceptions.ApiException,IOError) would avoid silently swallowing truly unexpected errors likeKeyboardInterrupt(thoughExceptiondoesn't catch that).This is a minor observation — the current approach is acceptable for test utility code.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/model_serving/model_server/kserve/multi_node/conftest.py` around lines 321 - 348, In _probe_inference_health, replace the broad "except Exception as exc:" around the pod.execute(command=cmd) call with a narrower catch that handles expected probe failures (e.g., except (kubernetes.client.exceptions.ApiException, IOError, OSError) as exc:) and log/return False for those, but re-raise truly fatal exceptions (SystemExit, KeyboardInterrupt) if encountered; reference the pod.execute(...) call and the except block around it when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 24-26: The LOGGER assignment is placed between import statements;
move the module-level LOGGER = get_logger(name=__name__) so that all imports
(including from utilities.general import download_model_data) come first, then
define LOGGER; ensure LOGGER uses get_logger and remains a top-level variable
after the import block and before any other module-level logic in conftest.py.
- Around line 217-225: TimeoutSampler is currently calling
get_pods_by_isvc_generation which raises ResourceNotFoundError when no pods
exist, causing the exception to escape instead of retrying; fix this by passing
exceptions_dict={ResourceNotFoundError: []} into the TimeoutSampler invocation
(the loop that constructs TimeoutSampler with
wait_timeout=Timeout.TIMEOUT_10MIN, sleep=10, func=get_pods_by_isvc_generation,
client=unprivileged_client, isvc=multi_node_inference_service) so the sampler
treats ResourceNotFoundError as a retriable condition and continues polling.
In `@utilities/infra.py`:
- Around line 811-812: Remove the invalid container-level reason from the
pod-phase check: delete CRASH_LOOPBACK_OFF from the tuple used when testing
pod_status.phase so the condition only checks valid pod phases (e.g.,
pod.Status.FAILED); update the code around the failed_pods population (the if
that references pod_status.phase and adds to failed_pods[pod.name]) to only
compare pod_status.phase against pod.Status.FAILED (leave container-level
CRASH_LOOPBACK_OFF handling where container state errors are collected).
---
Nitpick comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 330-332: The current loop filters worker pods using the fragile
substring check WORKER_POD_ROLE in pod.name; update the filter in the loop that
iterates get_pods_by_isvc_label(client=client, isvc=isvc) to prefer an explicit
pod label check (e.g., pod.labels.get('role') or pod.labels.get('pod-role') ==
WORKER_POD_ROLE) and only fall back to the pod.name substring check if no role
label exists; modify the loop body that references WORKER_POD_ROLE and pod.name
to first inspect pod.labels, then continue for matching worker-role pods.
- Around line 321-348: In _probe_inference_health, replace the broad "except
Exception as exc:" around the pod.execute(command=cmd) call with a narrower
catch that handles expected probe failures (e.g., except
(kubernetes.client.exceptions.ApiException, IOError, OSError) as exc:) and
log/return False for those, but re-raise truly fatal exceptions (SystemExit,
KeyboardInterrupt) if encountered; reference the pod.execute(...) call and the
except block around it when making this change.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (4)
tests/model_serving/model_server/kserve/multi_node/conftest.pyutilities/inference_utils.pyutilities/infra.pyutilities/manifests/vllm.py
Signed-off-by: Milind waykole <mwaykole@redhat.com>
cfdc70a to
305ba69
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
utilities/infra.py (1)
804-805:⚠️ Potential issue | 🟡 MinorRemove
CRASH_LOOPBACK_OFFfrom pod phase check.On Line 804,
CrashLoopBackOffis not a pod phase, so this condition is dead/misleading. Keep it only in container waiting/terminated reason checks.Suggested fix
- if pod_status.phase in (pod.Status.CRASH_LOOPBACK_OFF, pod.Status.FAILED): + if pod_status.phase == pod.Status.FAILED: failed_pods[pod.name] = pod_statusIn Kubernetes, is CrashLoopBackOff a Pod phase or a container waiting reason? Also list the valid Pod phase values from official Kubernetes documentation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@utilities/infra.py` around lines 804 - 805, The check comparing pod_status.phase against pod.Status.CRASH_LOOPBACK_OFF is incorrect because CrashLoopBackOff is a container waiting/termination reason, not a Pod phase; update the conditional in the block that populates failed_pods so it only checks pod_status.phase == pod.Status.FAILED (or other real Pod phases) and remove pod.Status.CRASH_LOOPBACK_OFF from that phase check, and ensure any CrashLoopBackOff detection is performed elsewhere by inspecting container statuses (e.g., iterating pod.status.container_statuses and checking each container.state.waiting.reason / container.state.terminated.reason for pod.Status.CRASH_LOOPBACK_OFF).tests/model_serving/model_server/kserve/multi_node/conftest.py (1)
219-225:⚠️ Potential issue | 🔴 CriticalReintroduced polling breakage: missing retriable exception config in sampler.
This is the same issue previously raised: when
get_pods_by_isvc_generationraisesResourceNotFoundError, polling exits early instead of retrying.Suggested fix
for sample in TimeoutSampler( wait_timeout=Timeout.TIMEOUT_10MIN, sleep=10, + exceptions_dict={ResourceNotFoundError: []}, func=get_pods_by_isvc_generation, client=unprivileged_client, isvc=multi_node_inference_service, ):#!/bin/bash # Confirm helper behavior and sampler configuration. UTILS_FILE=$(fd -t f "utils.py" tests/model_serving/model_server/kserve/multi_node | head -n1) echo "utils file: ${UTILS_FILE}" rg -n "def get_pods_by_isvc_generation|raise .*ResourceNotFoundError" "${UTILS_FILE}" -A20 -B4 rg -n "patched_multi_node_spec|TimeoutSampler\\(" tests/model_serving/model_server/kserve/multi_node/conftest.py -A20 -B4🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/model_serving/model_server/kserve/multi_node/conftest.py` around lines 219 - 225, The TimeoutSampler invocation around get_pods_by_isvc_generation is missing configuration to treat ResourceNotFoundError as retriable, so the sampler exits early when get_pods_by_isvc_generation raises ResourceNotFoundError; update the TimeoutSampler call (the instance created in conftest.py that currently passes wait_timeout, sleep, func=get_pods_by_isvc_generation, client=unprivileged_client, isvc=multi_node_inference_service) to include the retriable exceptions parameter (e.g., retriable_exceptions or retry_exceptions depending on TimeoutSampler API) and pass ResourceNotFoundError so the sampler will continue polling until success instead of failing on the first ResourceNotFoundError from get_pods_by_isvc_generation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 319-326: _wrap the call to get_pods_by_isvc_label inside
_probe_inference_health with a try/except for ResourceNotFoundError and treat
that case as “no pods yet” (return False or otherwise indicate unhealthy) so the
TimeoutSampler loop keeps retrying; specifically catch ResourceNotFoundError
thrown by get_pods_by_isvc_label, log/ignore it as needed, and ensure
_probe_inference_health returns a falsy value instead of letting the exception
propagate so TimeoutSampler (used in the for loop) can continue retries._
---
Duplicate comments:
In `@tests/model_serving/model_server/kserve/multi_node/conftest.py`:
- Around line 219-225: The TimeoutSampler invocation around
get_pods_by_isvc_generation is missing configuration to treat
ResourceNotFoundError as retriable, so the sampler exits early when
get_pods_by_isvc_generation raises ResourceNotFoundError; update the
TimeoutSampler call (the instance created in conftest.py that currently passes
wait_timeout, sleep, func=get_pods_by_isvc_generation,
client=unprivileged_client, isvc=multi_node_inference_service) to include the
retriable exceptions parameter (e.g., retriable_exceptions or retry_exceptions
depending on TimeoutSampler API) and pass ResourceNotFoundError so the sampler
will continue polling until success instead of failing on the first
ResourceNotFoundError from get_pods_by_isvc_generation.
In `@utilities/infra.py`:
- Around line 804-805: The check comparing pod_status.phase against
pod.Status.CRASH_LOOPBACK_OFF is incorrect because CrashLoopBackOff is a
container waiting/termination reason, not a Pod phase; update the conditional in
the block that populates failed_pods so it only checks pod_status.phase ==
pod.Status.FAILED (or other real Pod phases) and remove
pod.Status.CRASH_LOOPBACK_OFF from that phase check, and ensure any
CrashLoopBackOff detection is performed elsewhere by inspecting container
statuses (e.g., iterating pod.status.container_statuses and checking each
container.state.waiting.reason / container.state.terminated.reason for
pod.Status.CRASH_LOOPBACK_OFF).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (4)
tests/model_serving/model_server/kserve/multi_node/conftest.pyutilities/inference_utils.pyutilities/infra.pyutilities/manifests/vllm.py
🚧 Files skipped from review as they are similar to previous changes (1)
- utilities/manifests/vllm.py
|
Status of building tag latest: success. |
Summary
Related Issues
How it has been tested
Additional Requirements
Summary by CodeRabbit
Bug Fixes
Tests
Documentation