Skip to content

Refactor llm-d tests to use declarative config classes#1182

Merged
mwaykole merged 15 commits intoopendatahub-io:mainfrom
threcc:refactor-llmd
Mar 11, 2026
Merged

Refactor llm-d tests to use declarative config classes#1182
mwaykole merged 15 commits intoopendatahub-io:mainfrom
threcc:refactor-llmd

Conversation

@threcc
Copy link
Copy Markdown
Contributor

@threcc threcc commented Mar 6, 2026

Description

Replace the imperative fixture approach in llmd/ with declarative config classes.

Why

The current llmd/ tests pass large dictionaries of kwargs through fixtures (llmd_inference_service, llmd_inference_service_s3, llmd_inference_service_gpu, etc.), each with inline defaults, conditional branches, and duplicated resource definitions.
Adding a new test scenario means either adding more kwargs to an already-complex fixture or writing a new fixture from scratch.

What changes

  • Config classes (LLMISvcConfig, CpuConfig, GpuConfig) define model, resources, probes, and env as class attributes with sensible defaults. Each test scenario is a small subclass that overrides only what differs.

  • Single llmisvc fixture replaces 4+ specialized fixtures. It reads everything from the config class — no conditional logic, no kwargs sprawl.

  • Flat test directory — no subdirectories. Each test file derives its namespace from its filename.

  • Per-config timeouts — CPU configs fail fast (3 min), GPU configs get 8 min, instead of a blanket 15 min for everything.

  • Test markers aligned with the new gating tiers:

    • smoke — runs on every RHOAI build (~5 min per component)
    • tier1 — runs daily (~15 min per component)
    • tier2 — runs weekly, no time limit
    • tier3 — runs weekly, no time limit (negative/destructive tests)

Summary by CodeRabbit

  • Tests

    • Expanded end-to-end coverage with many new CPU/GPU smoke, auth, connection, kueue, scheduler/no-scheduler, prefix-cache (estimated & precise), and prefill-decode tests; removed several legacy test modules.
    • New external TLS-aware request helpers, metric-based routing assertions, and scheduler decision inspection utilities.
  • Refactor

    • Switched tests to a config-driven service creation model and unified/authenticated fixture management.
  • Chores

    • Added an alias for an additional HuggingFace model in model storage constants.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 6, 2026

The following are automatically added/executed:

  • PR size label.
  • Run pre-commit
  • Run tox
  • Add PR author as the PR assignee
  • Build image based on the PR

Available user actions:

  • To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
  • To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
  • To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
    lgtm label removed on each new commit push.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
    and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
  • To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
    pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.
Supported labels

{'/lgtm', '/wip', '/build-push-pr-image', '/hold', '/verified', '/cherry-pick'}

@threcc threcc changed the title Refactor llmd Refactor llm-d tests to use declarative config classes Mar 10, 2026
@threcc threcc marked this pull request as ready for review March 10, 2026 15:15
@threcc threcc requested review from a team, Raghul-M, brettmthompson and mwaykole as code owners March 10, 2026 15:15
@opendatahub-io opendatahub-io deleted a comment from coderabbitai bot Mar 10, 2026
@coderabbitai

This comment was marked as spam.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/model_serving/model_server/llmd/test_llmd_kueue_integration.py (2)

139-152: ⚠️ Potential issue | 🟡 Minor

Same stale-variable issue with running_pods and gated_pods.

Apply the same pattern to capture the last observed values before the exception is raised.

Proposed fix
         # Verify that Kueue correctly gates the second pod.
+        last_running, last_gated = None, None
         try:
             for running_pods, gated_pods in TimeoutSampler(
                 wait_timeout=120,
                 sleep=5,
                 func=lambda: check_gated_pods_and_running_pods(selector_labels, llmisvc.namespace, unprivileged_client),
             ):
+                last_running, last_gated = running_pods, gated_pods
                 if running_pods == EXPECTED_RUNNING_PODS and gated_pods == EXPECTED_GATED_PODS:
                     break
         except TimeoutExpiredError:
             raise UnexpectedResourceCountError(
                 "Timeout: "
                 f"Expected {EXPECTED_RUNNING_PODS} running and {EXPECTED_GATED_PODS} gated pods. "
-                f"Found {running_pods} running and {gated_pods} gated."
+                f"Found {last_running} running and {last_gated} gated."
             ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/test_llmd_kueue_integration.py` around
lines 139 - 152, The TimeoutSampler loop may raise TimeoutExpiredError without
capturing the last observed running_pods and gated_pods, causing stale-variable
use in the exception message; modify the try block so you assign the sampler
result to local variables before breaking (e.g., iterate and set last_running,
last_gated from the tuple returned by check_gated_pods_and_running_pods inside
the for loop), then in the except TimeoutExpiredError handler reference those
last_running and last_gated variables when raising UnexpectedResourceCountError
(update references to running_pods/gated_pods to these captured names) so the
error message shows the final observed counts from the TimeoutSampler that
invoked check_gated_pods_and_running_pods with llmisvc.namespace and
unprivileged_client.

124-136: ⚠️ Potential issue | 🟡 Minor

Stale variable in exception message if loop never yields.

If TimeoutSampler raises TimeoutExpiredError before the first yield (edge case with very short timeout or immediate failure), replicas in line 135 references the value from line 115, not the actual last observed value. Consider capturing the last value explicitly.

Proposed fix
         # Check the deployment until it has 2 replicas, which means it's been updated
+        last_replicas = None
         try:
             for replicas in TimeoutSampler(
                 wait_timeout=60,
                 sleep=2,
                 func=lambda: self._get_deployment_status_replicas(deployment),
             ):
+                last_replicas = replicas
                 if replicas == EXPECTED_UPDATED_REPLICAS:
                     break
         except TimeoutExpiredError:
             raise UnexpectedResourceCountError(
                 f"Timeout waiting for deployment to update. "
-                f"Expected {EXPECTED_UPDATED_REPLICAS} replicas, found {replicas}."
+                f"Expected {EXPECTED_UPDATED_REPLICAS} replicas, found {last_replicas}."
             ) from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/test_llmd_kueue_integration.py` around
lines 124 - 136, The exception message uses the loop variable replicas which may
be stale if TimeoutSampler raises before the first yield; modify the block
around TimeoutSampler to track the last observed value (e.g., introduce
last_replicas = None before the try, assign last_replicas = replicas inside the
loop each iteration, and check last_replicas == EXPECTED_UPDATED_REPLICAS to
break), and when raising UnexpectedResourceCountError in the TimeoutExpiredError
handler reference last_replicas (or a safe placeholder like "None"/"no value")
instead of replicas to ensure the message reflects the actual last observed
value; update references to EXPECTED_UPDATED_REPLICAS and
TimeoutSampler/TimeoutExpiredError accordingly.
🧹 Nitpick comments (4)
tests/model_serving/model_server/llmd/utils.py (2)

280-289: Consider LOGGER.exception for stack trace preservation.

Line 286 uses LOGGER.error which loses the traceback. LOGGER.exception would preserve the full stack trace for debugging failed requests.

♻️ Proposed change
         except Exception as e:  # noqa: BLE001
-            LOGGER.error(f"Request {i + 1}/{count} failed: {e}")
+            LOGGER.exception(f"Request {i + 1}/{count} failed: {e}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/utils.py` around lines 280 - 289, The
catch block in the loop calling send_chat_completions currently logs errors with
LOGGER.error which discards the traceback; replace LOGGER.error(f"Request {i +
1}/{count} failed: {e}") with a call to LOGGER.exception using the same message
so the exception stack trace is preserved (keep the surrounding try/except,
variables i, count and the message format intact).

155-158: Missing error handling for malformed responses.

json.loads and dictionary access will raise exceptions on malformed or unexpected response structures. Consider wrapping in try/except for clearer error messages in test failures.

🛡️ Proposed defensive handling
 def parse_completion_text(response_body: str) -> str:
     """Extract completion text from a chat completion response."""
-    data = json.loads(response_body)
-    return data["choices"][0]["message"]["content"]
+    try:
+        data = json.loads(response_body)
+        return data["choices"][0]["message"]["content"]
+    except (json.JSONDecodeError, KeyError, IndexError) as e:
+        raise ValueError(f"Failed to parse completion response: {e}\nBody: {response_body[:500]}") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/utils.py` around lines 155 - 158, The
function parse_completion_text should defensively handle malformed responses:
wrap json.loads(response_body) and the subsequent dict/list accesses in a
try/except that catches json.JSONDecodeError, KeyError, IndexError and
TypeError, and raise a clearer ValueError (or re-raise with additional context)
that includes the original response_body and the underlying exception message;
update references inside parse_completion_text to validate that data["choices"]
is a non-empty list and that the first choice contains ["message"]["content"]
before returning it.
tests/model_serving/model_server/llmd/llmd_configs/config_prefill_decode.py (1)

12-17: Duplicated resource limits from GpuConfig.

container_resources() duplicates the exact values defined in GpuConfig.container_resources(). Since QwenS3Config inherits from GpuConfig, this override is redundant and violates DRY.

♻️ Option 1: Remove the override entirely (if no changes needed)
-    `@classmethod`
-    def container_resources(cls):
-        return {
-            "limits": {"cpu": "4", "memory": "32Gi", "nvidia.com/gpu": "1"},
-            "requests": {"cpu": "2", "memory": "16Gi", "nvidia.com/gpu": "1"},
-        }
♻️ Option 2: Call parent if customization is planned
     `@classmethod`
     def container_resources(cls):
-        return {
-            "limits": {"cpu": "4", "memory": "32Gi", "nvidia.com/gpu": "1"},
-            "requests": {"cpu": "2", "memory": "16Gi", "nvidia.com/gpu": "1"},
-        }
+        # Inherit GPU config; override if prefill needs different resources
+        return super().container_resources()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/llmd_configs/config_prefill_decode.py`
around lines 12 - 17, QwenS3Config.container_resources duplicates
GpuConfig.container_resources; remove the redundant override or delegate to the
parent to follow DRY: delete the container_resources method from QwenS3Config
(or replace its body with a call to super().container_resources() if you plan to
customize later) so the class inherits the GPU resource settings from GpuConfig
and avoids duplication.
tests/model_serving/model_server/llmd/conftest.py (1)

236-243: Fail fast on incomplete config classes.

LLMISvcConfig defaults name and storage_uri to empty strings. If a test forgets one of them, this helper builds an invalid resource and the failure shows up later as CRD validation noise or a 180/480s timeout instead of an immediate fixture error.

Patch
 `@contextmanager`
 def _create_llmisvc_from_config(
     config_cls: type,
     namespace: str,
     client: DynamicClient,
     service_account: str | None = None,
 ) -> Generator[LLMInferenceService, Any]:
     """Create an LLMInferenceService from a config class."""
+    missing = [field for field in ("name", "storage_uri") if not getattr(config_cls, field, None)]
+    if missing:
+        raise ValueError(
+            f"{config_cls.__name__} is missing required config field(s): {', '.join(missing)}"
+        )
+
     LOGGER.info(f"\n{config_cls.describe(namespace=namespace)}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/conftest.py` around lines 236 - 243,
The helper _create_llmisvc_from_config builds an LLMInferenceService from a
config class that may have default empty name or storage_uri
(LLMISvcConfig.name, LLMISvcConfig.storage_uri), causing late CRD validation or
long timeouts; update _create_llmisvc_from_config to validate the config before
building the resource: instantiate or access the provided config_cls, assert
that name and storage_uri are non-empty (raise a clear exception referencing
LLMISvcConfig/name/storage_uri and the config class) so tests fail immediately
with a descriptive error when those required fields are missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/llmd/conftest.py`:
- Around line 22-23: The imports in conftest.py are stale and should be
package-local: replace the absolute imports that reference
tests.model_serving.model_server.llmd_v2 with relative imports so the module
finds TinyLlamaOciConfig and wait_for_llmisvc locally; specifically update the
import statements that bring in TinyLlamaOciConfig and wait_for_llmisvc to use
module-local (relative) imports from the current llmd package (e.g., use
relative import syntax to import TinyLlamaOciConfig from llmd_configs and
wait_for_llmisvc from utils).

In
`@tests/model_serving/model_server/llmd/llmd_configs/config_estimated_prefix_cache.py`:
- Around line 25-26: The VLLM_ADDITIONAL_ARGS value is missing the explicit
--enable-prefix-caching switch so prefix caching may remain disabled; update the
environment variable string that currently uses cls.hash_algo and cls.block_size
to also include --enable-prefix-caching (e.g., add the flag alongside the
existing --prefix-caching-hash-algo {cls.hash_algo} --block-size
{cls.block_size}) so the Estimated Prefix Cache config explicitly enables
Automatic Prefix Caching; modify the code that sets "VLLM_ADDITIONAL_ARGS" in
the config_estimated_prefix_cache class/constructor where cls.hash_algo and
cls.block_size are referenced.

In `@tests/model_serving/model_server/llmd/test_llmd_auth.py`:
- Around line 3-6: The test imports are referencing the old package path
'tests.model_serving.model_server.llmd_v2.utils'—update the import statements in
test_llmd_auth.py to point to 'tests.model_serving.model_server.llmd.utils' so
ns_from_file, parse_completion_text, and send_chat_completions are imported from
the renamed package; locate the import line that currently mentions llmd_v2 and
change it to llmd without altering the imported symbol names.

In `@tests/model_serving/model_server/llmd/test_llmd_connection_cpu.py`:
- Around line 4-9: Update the stale import paths that reference "llmd_v2" to the
new "llmd" package: replace the imports for TinyLlamaHfConfig and
TinyLlamaS3Config (currently imported from
tests.model_serving.model_server.llmd_v2.llmd_configs) and the utility imports
ns_from_file, parse_completion_text, send_chat_completions (currently from
tests.model_serving.model_server.llmd_v2.utils) so they import from
tests.model_serving.model_server.llmd.llmd_configs and
tests.model_serving.model_server.llmd.utils respectively; keep the symbol names
unchanged (TinyLlamaHfConfig, TinyLlamaS3Config, ns_from_file,
parse_completion_text, send_chat_completions) so code referencing those names
continues to work.

In `@tests/model_serving/model_server/llmd/test_llmd_connection_gpu.py`:
- Around line 4-9: The test imports still reference the old package name
llmd_v2; update the import statements so QwenHfConfig and QwenS3Config are
imported from tests.model_serving.model_server.llmd and ns_from_file,
parse_completion_text, send_chat_completions are imported from
tests.model_serving.model_server.llmd.utils (i.e., replace llmd_v2 with llmd) so
the test uses the refactored config/helpers instead of the legacy package.
- Around line 40-41: The GPU availability check must run before the expensive
llmisvc provisioning; ensure the session-scoped skip_if_no_gpu_available fixture
is executed earlier by adding it as an explicit dependency instead of an in-test
check: either add skip_if_no_gpu_available as a parameter to the test function
that uses the llmisvc (so pytest runs it before instantiating the class-scoped
llmisvc fixture) or add skip_if_no_gpu_available as a dependency inside the
llmisvc fixture definition (so llmisvc will be skipped during setup). Reference
llmisvc and the _create_llmisvc_from_config context manager when making the
change so the skip prevents entering the expensive provisioning path.

In `@tests/model_serving/model_server/llmd/test_llmd_kueue_integration.py`:
- Around line 6-11: The imports at the top of the test use the old package name
`llmd_v2`; update the import paths to the renamed package `llmd` so the test can
locate TinyLlamaOciConfig and the helper functions. Specifically, replace
references to tests.model_serving.model_server.llmd_v2 (for TinyLlamaOciConfig,
ns_from_file, parse_completion_text, send_chat_completions) with
tests.model_serving.model_server.llmd so the symbols TinyLlamaOciConfig,
ns_from_file, parse_completion_text, and send_chat_completions resolve
correctly.

In `@tests/model_serving/model_server/llmd/test_llmd_no_scheduler.py`:
- Around line 4-9: The imports in the test reference the old package name
llmd_v2; update them to import from the refactored llmd package so the test uses
the new module. Specifically, change the import targets for QwenS3Config and the
utilities ns_from_file, parse_completion_text, and send_chat_completions to
import from tests.model_serving.model_server.llmd.llmd_configs and
tests.model_serving.model_server.llmd.utils respectively, ensuring the test file
under llmd/ exercises the moved code.
- Around line 45-46: The GPU availability check (gpu_count_on_cluster) must run
before the class-scoped llmisvc is created because pytest resolves class
fixtures before function fixtures; move the skip logic into a fixture that runs
earlier (e.g., make a new fixture or add the check inside
unprivileged_model_namespace) and have llmisvc depend on that fixture, or change
gpu_count_on_cluster to a fixture with scope that llmisvc depends on; ensure the
check occurs before calling _create_llmisvc_from_config() so the deployment is
skipped when GPUs are unavailable.

In `@tests/model_serving/model_server/llmd/test_llmd_prefill_decode.py`:
- Around line 4-9: The imports at the top use the stale package path llmd_v2;
update both imports to import PrefillDecodeConfig from
tests.model_serving.model_server.llmd.llmd_configs and ns_from_file,
parse_completion_text, send_chat_completions from
tests.model_serving.model_server.llmd.utils (or replace any other occurrences of
llmd_v2 in this test) so the module resolution uses llmd instead of llmd_v2;
ensure the imported symbols (PrefillDecodeConfig, ns_from_file,
parse_completion_text, send_chat_completions) remain unchanged.

In
`@tests/model_serving/model_server/llmd/test_llmd_singlenode_estimated_prefix_cache.py`:
- Around line 6-13: The imports at the top of the test reference the wrong
package path: replace occurrences of tests.model_serving.model_server.llmd_v2
with tests.model_serving.model_server.llmd so the module resolution points to
the existing llmd package; specifically update the import lines that bring in
EstimatedPrefixCacheConfig and the utility symbols assert_prefix_cache_routing,
get_llmd_router_scheduler_pod, get_llmd_workload_pods, ns_from_file, and
send_prefix_cache_requests to use tests.model_serving.model_server.llmd instead
of tests.model_serving.model_server.llmd_v2.

In
`@tests/model_serving/model_server/llmd/test_llmd_singlenode_precise_prefix_cache.py`:
- Around line 65-76: The test currently uses the value returned by
send_prefix_cache_requests (variable successful) directly in
assert_prefix_cache_routing and assert_scheduler_routing, which can let the test
pass vacuously if successful < NUM_REQUESTS; add a fail-fast check right after
the call to send_prefix_cache_requests to assert that successful == NUM_REQUESTS
(or at minimum successful > 0) and raise a clear test failure if not,
referencing the send_prefix_cache_requests return value (successful) and
NUM_REQUESTS so the test aborts early instead of feeding incomplete traffic into
assert_prefix_cache_routing and assert_scheduler_routing.
- Around line 6-14: The imports at the top of the test are pointing to the old
package llmd_v2; update them to import PrecisePrefixCacheConfig and the helper
functions (assert_prefix_cache_routing, assert_scheduler_routing,
get_llmd_router_scheduler_pod, get_llmd_workload_pods, ns_from_file,
send_prefix_cache_requests) from tests.model_serving.model_server.llmd instead
of tests.model_serving.model_server.llmd_v2 so the test uses the renamed llmd
package and the new declarative config surface.
- Around line 55-56: The GPU-count skip currently runs in the test body after
the class-scoped llmisvc fixture (and its blocking wait_for_llmisvc()) has
already provisioned resources; move the check so it runs before llmisvc by
creating a lightweight fixture (e.g., gpu_count_required or require_min_gpus)
that reads gpu_count_on_cluster and calls pytest.skip or raise
pytest.skip.Exception when count < 2, and declare that fixture as a dependency
of the test class or as a module/class-scoped fixture used by llmisvc, or
alternatively replace the inline check with pytest.mark.skipif(condition)
applied at collection; ensure you reference the llmisvc fixture and the
wait_for_llmisvc call when relocating the guard so provisioning is avoided.

In `@tests/model_serving/model_server/llmd/test_llmd_smoke.py`:
- Around line 4-9: The test imports use the stale package path "llmd_v2" causing
ModuleNotFoundError; update the imports to use "llmd" instead. Locate the import
block that references TinyLlamaOciConfig and the utility functions ns_from_file,
parse_completion_text, send_chat_completions and change their module path from
tests.model_serving.model_server.llmd_v2.* to
tests.model_serving.model_server.llmd.* so TinyLlamaOciConfig and the three
utility symbols import from the correct package.

In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 292-322: The lookback_seconds parameter in
get_scheduler_decision_logs is unused; either remove it or pass it into the pod
log call to implement time-based filtering — update
router_scheduler_pod.log(...) to include the appropriate
since_seconds/since_time argument (e.g., since_seconds=lookback_seconds)
supported by the Pod.log API, and update the docstring accordingly so the
function (get_scheduler_decision_logs) actually retrieves logs limited to the
requested lookback window.

---

Outside diff comments:
In `@tests/model_serving/model_server/llmd/test_llmd_kueue_integration.py`:
- Around line 139-152: The TimeoutSampler loop may raise TimeoutExpiredError
without capturing the last observed running_pods and gated_pods, causing
stale-variable use in the exception message; modify the try block so you assign
the sampler result to local variables before breaking (e.g., iterate and set
last_running, last_gated from the tuple returned by
check_gated_pods_and_running_pods inside the for loop), then in the except
TimeoutExpiredError handler reference those last_running and last_gated
variables when raising UnexpectedResourceCountError (update references to
running_pods/gated_pods to these captured names) so the error message shows the
final observed counts from the TimeoutSampler that invoked
check_gated_pods_and_running_pods with llmisvc.namespace and
unprivileged_client.
- Around line 124-136: The exception message uses the loop variable replicas
which may be stale if TimeoutSampler raises before the first yield; modify the
block around TimeoutSampler to track the last observed value (e.g., introduce
last_replicas = None before the try, assign last_replicas = replicas inside the
loop each iteration, and check last_replicas == EXPECTED_UPDATED_REPLICAS to
break), and when raising UnexpectedResourceCountError in the TimeoutExpiredError
handler reference last_replicas (or a safe placeholder like "None"/"no value")
instead of replicas to ensure the message reflects the actual last observed
value; update references to EXPECTED_UPDATED_REPLICAS and
TimeoutSampler/TimeoutExpiredError accordingly.

---

Nitpick comments:
In `@tests/model_serving/model_server/llmd/conftest.py`:
- Around line 236-243: The helper _create_llmisvc_from_config builds an
LLMInferenceService from a config class that may have default empty name or
storage_uri (LLMISvcConfig.name, LLMISvcConfig.storage_uri), causing late CRD
validation or long timeouts; update _create_llmisvc_from_config to validate the
config before building the resource: instantiate or access the provided
config_cls, assert that name and storage_uri are non-empty (raise a clear
exception referencing LLMISvcConfig/name/storage_uri and the config class) so
tests fail immediately with a descriptive error when those required fields are
missing.

In `@tests/model_serving/model_server/llmd/llmd_configs/config_prefill_decode.py`:
- Around line 12-17: QwenS3Config.container_resources duplicates
GpuConfig.container_resources; remove the redundant override or delegate to the
parent to follow DRY: delete the container_resources method from QwenS3Config
(or replace its body with a call to super().container_resources() if you plan to
customize later) so the class inherits the GPU resource settings from GpuConfig
and avoids duplication.

In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 280-289: The catch block in the loop calling send_chat_completions
currently logs errors with LOGGER.error which discards the traceback; replace
LOGGER.error(f"Request {i + 1}/{count} failed: {e}") with a call to
LOGGER.exception using the same message so the exception stack trace is
preserved (keep the surrounding try/except, variables i, count and the message
format intact).
- Around line 155-158: The function parse_completion_text should defensively
handle malformed responses: wrap json.loads(response_body) and the subsequent
dict/list accesses in a try/except that catches json.JSONDecodeError, KeyError,
IndexError and TypeError, and raise a clearer ValueError (or re-raise with
additional context) that includes the original response_body and the underlying
exception message; update references inside parse_completion_text to validate
that data["choices"] is a non-empty list and that the first choice contains
["message"]["content"] before returning it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: cced964a-0f8a-4107-b4f8-f11cada2ffe2

📥 Commits

Reviewing files that changed from the base of the PR and between a811bea and d7c2546.

📒 Files selected for processing (26)
  • tests/model_serving/model_server/llmd/__init__.py
  • tests/model_serving/model_server/llmd/conftest.py
  • tests/model_serving/model_server/llmd/constants.py
  • tests/model_serving/model_server/llmd/kueue/__init__.py
  • tests/model_serving/model_server/llmd/llmd_configs/__init__.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_base.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_estimated_prefix_cache.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_models.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_precise_prefix_cache.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_prefill_decode.py
  • tests/model_serving/model_server/llmd/test_llmd_auth.py
  • tests/model_serving/model_server/llmd/test_llmd_connection_cpu.py
  • tests/model_serving/model_server/llmd/test_llmd_connection_gpu.py
  • tests/model_serving/model_server/llmd/test_llmd_kueue_integration.py
  • tests/model_serving/model_server/llmd/test_llmd_no_scheduler.py
  • tests/model_serving/model_server/llmd/test_llmd_oci_cpu.py
  • tests/model_serving/model_server/llmd/test_llmd_prefill_decode.py
  • tests/model_serving/model_server/llmd/test_llmd_s3.py
  • tests/model_serving/model_server/llmd/test_llmd_s3_gpu.py
  • tests/model_serving/model_server/llmd/test_llmd_singlenode_estimated_prefix_cache.py
  • tests/model_serving/model_server/llmd/test_llmd_singlenode_precise_prefix_cache.py
  • tests/model_serving/model_server/llmd/test_llmd_smoke.py
  • tests/model_serving/model_server/llmd/test_singlenode_estimated_prefix_cache.py
  • tests/model_serving/model_server/llmd/utils.py
  • utilities/constants.py
  • utilities/llmd_constants.py
💤 Files with no reviewable changes (5)
  • tests/model_serving/model_server/llmd/constants.py
  • tests/model_serving/model_server/llmd/test_llmd_oci_cpu.py
  • tests/model_serving/model_server/llmd/test_llmd_s3.py
  • tests/model_serving/model_server/llmd/test_llmd_s3_gpu.py
  • tests/model_serving/model_server/llmd/test_singlenode_estimated_prefix_cache.py

Comment thread tests/model_serving/model_server/llmd/conftest.py Outdated
Comment thread tests/model_serving/model_server/llmd/test_llmd_auth.py Outdated
Comment thread tests/model_serving/model_server/llmd/test_llmd_connection_cpu.py Outdated
Comment thread tests/model_serving/model_server/llmd/test_llmd_connection_gpu.py Outdated
Comment thread tests/model_serving/model_server/llmd/test_llmd_smoke.py Outdated
Comment thread tests/model_serving/model_server/llmd/utils.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/model_serving/model_server/llmd/utils.py (1)

302-320: ⚠️ Potential issue | 🟡 Minor

Use lookback_seconds when reading scheduler logs.

Ignoring this window makes assert_scheduler_routing() susceptible to stale decisions from earlier requests and forces a full log scan on every retry. Thread the requested lookback into Pod.log(...) so the assertion only inspects fresh scheduler output.

Proposed fix
-    raw_logs = router_scheduler_pod.log()
+    raw_logs = router_scheduler_pod.log(since_seconds=lookback_seconds)

Verify the exact kwarg name against the Pod.log() API in the installed ocp_resources version before landing this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/utils.py` around lines 302 - 320,
get_scheduler_decision_logs currently ignores the lookback_seconds parameter
when calling router_scheduler_pod.log(), causing stale entries to be scanned;
update the call to router_scheduler_pod.log(...) to pass the lookback_seconds
window (e.g., using the Pod.log kwarg that accepts seconds—verify whether the
installed ocp_resources.Pod.log uses since_seconds, since, or a different param
name) so the function filters logs to only the recent lookback_seconds before
parsing and returning scheduler decision entries.
🧹 Nitpick comments (2)
tests/model_serving/model_server/llmd/conftest.py (1)

172-192: Unused cluster_monitoring_config parameter appears intentional for dependency ordering.

The Ruff warning (ARG001) is technically correct—the parameter isn't used directly. However, this pattern is common in pytest to ensure fixture execution order. Consider adding a comment or using pytest.mark.usefixtures at the fixture level to make intent explicit.

# Depends on cluster_monitoring_config to ensure cluster monitoring is enabled first
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/conftest.py` around lines 172 - 192,
The unused parameter cluster_monitoring_config in the fixture
llmd_user_workload_monitoring_config_map is intentional to order fixture setup;
make that explicit by either adding an inline comment above the fixture
explaining the dependency (e.g., "Depends on cluster_monitoring_config to ensure
cluster monitoring is enabled first") or annotate the fixture with
pytest.mark.usefixtures("cluster_monitoring_config") so the intent is clear and
the ARG001 warning is addressed while keeping the fixture behavior unchanged.
tests/model_serving/model_server/llmd/llmd_configs/config_base.py (1)

114-120: Consider iterable unpacking for cleaner list extension.

The list concatenation works but iterable unpacking is more idiomatic Python.

Optional refactor
     `@classmethod`
     def container_env(cls):
-        return super().container_env() + [
+        return [
+            *super().container_env(),
             {
                 "name": "VLLM_ADDITIONAL_ARGS",
                 "value": "--max-num-seqs 20 --max-model-len 128 --enforce-eager --ssl-ciphers ECDHE+AESGCM:DHE+AESGCM",
             },
             {"name": "VLLM_CPU_KVCACHE_SPACE", "value": "4"},
         ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_serving/model_server/llmd/llmd_configs/config_base.py` around
lines 114 - 120, Replace the list concatenation in the container_env method with
iterable unpacking: build a single list literal that first expands the iterable
returned by super().container_env() and then includes the two environment dicts
(the VLLM_ADDITIONAL_ARGS and VLLM_CPU_KVCACHE_SPACE entries); update the method
named container_env in config_base.py accordingly to use unpacking instead of
using + to concatenate lists.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 67-72: The TLS handling currently fails open: _resolve_ca_cert
silently returns empty string on error which causes
_curl_post/send_chat_completions to use --insecure; change this to fail closed
by making _resolve_ca_cert raise the exception (or return a clearly typed None)
when CA resolution fails, update callers (send_chat_completions and _curl_post)
to require an explicit allow_insecure flag to opt into skipping verification and
otherwise abort/propagate the error, and ensure logs clearly state that CA
resolution failed and that verification is enforced unless allow_insecure is
true.

---

Duplicate comments:
In `@tests/model_serving/model_server/llmd/utils.py`:
- Around line 302-320: get_scheduler_decision_logs currently ignores the
lookback_seconds parameter when calling router_scheduler_pod.log(), causing
stale entries to be scanned; update the call to router_scheduler_pod.log(...) to
pass the lookback_seconds window (e.g., using the Pod.log kwarg that accepts
seconds—verify whether the installed ocp_resources.Pod.log uses since_seconds,
since, or a different param name) so the function filters logs to only the
recent lookback_seconds before parsing and returning scheduler decision entries.

---

Nitpick comments:
In `@tests/model_serving/model_server/llmd/conftest.py`:
- Around line 172-192: The unused parameter cluster_monitoring_config in the
fixture llmd_user_workload_monitoring_config_map is intentional to order fixture
setup; make that explicit by either adding an inline comment above the fixture
explaining the dependency (e.g., "Depends on cluster_monitoring_config to ensure
cluster monitoring is enabled first") or annotate the fixture with
pytest.mark.usefixtures("cluster_monitoring_config") so the intent is clear and
the ARG001 warning is addressed while keeping the fixture behavior unchanged.

In `@tests/model_serving/model_server/llmd/llmd_configs/config_base.py`:
- Around line 114-120: Replace the list concatenation in the container_env
method with iterable unpacking: build a single list literal that first expands
the iterable returned by super().container_env() and then includes the two
environment dicts (the VLLM_ADDITIONAL_ARGS and VLLM_CPU_KVCACHE_SPACE entries);
update the method named container_env in config_base.py accordingly to use
unpacking instead of using + to concatenate lists.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 533e9df8-5ac0-43a2-bc6e-dc9dd908026c

📥 Commits

Reviewing files that changed from the base of the PR and between d7c2546 and 63427a1.

📒 Files selected for processing (26)
  • tests/model_serving/model_server/llmd/__init__.py
  • tests/model_serving/model_server/llmd/conftest.py
  • tests/model_serving/model_server/llmd/constants.py
  • tests/model_serving/model_server/llmd/kueue/__init__.py
  • tests/model_serving/model_server/llmd/llmd_configs/__init__.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_base.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_estimated_prefix_cache.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_models.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_precise_prefix_cache.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_prefill_decode.py
  • tests/model_serving/model_server/llmd/test_llmd_auth.py
  • tests/model_serving/model_server/llmd/test_llmd_connection_cpu.py
  • tests/model_serving/model_server/llmd/test_llmd_connection_gpu.py
  • tests/model_serving/model_server/llmd/test_llmd_kueue_integration.py
  • tests/model_serving/model_server/llmd/test_llmd_no_scheduler.py
  • tests/model_serving/model_server/llmd/test_llmd_oci_cpu.py
  • tests/model_serving/model_server/llmd/test_llmd_prefill_decode.py
  • tests/model_serving/model_server/llmd/test_llmd_s3.py
  • tests/model_serving/model_server/llmd/test_llmd_s3_gpu.py
  • tests/model_serving/model_server/llmd/test_llmd_singlenode_estimated_prefix_cache.py
  • tests/model_serving/model_server/llmd/test_llmd_singlenode_precise_prefix_cache.py
  • tests/model_serving/model_server/llmd/test_llmd_smoke.py
  • tests/model_serving/model_server/llmd/test_singlenode_estimated_prefix_cache.py
  • tests/model_serving/model_server/llmd/utils.py
  • utilities/constants.py
  • utilities/llmd_constants.py
💤 Files with no reviewable changes (5)
  • tests/model_serving/model_server/llmd/test_llmd_oci_cpu.py
  • tests/model_serving/model_server/llmd/constants.py
  • tests/model_serving/model_server/llmd/test_llmd_s3.py
  • tests/model_serving/model_server/llmd/test_singlenode_estimated_prefix_cache.py
  • tests/model_serving/model_server/llmd/test_llmd_s3_gpu.py
🚧 Files skipped from review as they are similar to previous changes (9)
  • tests/model_serving/model_server/llmd/test_llmd_smoke.py
  • tests/model_serving/model_server/llmd/test_llmd_no_scheduler.py
  • tests/model_serving/model_server/llmd/init.py
  • tests/model_serving/model_server/llmd/llmd_configs/config_prefill_decode.py
  • utilities/constants.py
  • tests/model_serving/model_server/llmd/test_llmd_connection_cpu.py
  • tests/model_serving/model_server/llmd/test_llmd_singlenode_estimated_prefix_cache.py
  • tests/model_serving/model_server/llmd/test_llmd_prefill_decode.py
  • tests/model_serving/model_server/llmd/test_llmd_auth.py

Comment thread tests/model_serving/model_server/llmd/utils.py
@mwaykole
Copy link
Copy Markdown
Member

/lgtm

mwaykole
mwaykole previously approved these changes Mar 11, 2026
@mwaykole mwaykole enabled auto-merge (squash) March 11, 2026 12:01
@mwaykole mwaykole merged commit 5fe569b into opendatahub-io:main Mar 11, 2026
10 checks passed
@github-actions
Copy link
Copy Markdown

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants