Skip to content

[RHOAIENG-34594] Add test case for singlenode with estimated prefix cache #907

Merged
threcc merged 16 commits intoopendatahub-io:mainfrom
threcc:singlenode-precise-prefix
Dec 10, 2025
Merged

[RHOAIENG-34594] Add test case for singlenode with estimated prefix cache #907
threcc merged 16 commits intoopendatahub-io:mainfrom
threcc:singlenode-precise-prefix

Conversation

@threcc
Copy link
Copy Markdown
Contributor

@threcc threcc commented Dec 3, 2025

Description

Add a new test case to cover the Single-node Estimated Prefix Cache scenario, check the comments #907 (comment)

Jira issue tracker: https://issues.redhat.com/browse/RHOAIENG-34594

How Has This Been Tested?

Run the test locally vs a cluster with GPU (tesla t4)

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests
    • Added end-to-end tests validating single-node estimated prefix-cache routing, request distribution, and GPU-aware skipping.
    • Added fixtures for authenticated access and single-node cache-enabled service setup.
    • Introduced test helpers to discover service pods, send authenticated test requests, and verify Prometheus metrics (request counts and cache hit rate).
    • Added constants for health checks and prefix-cache configuration used by the tests.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Dec 3, 2025

The following are automatically added/executed:

  • PR size label.
  • Run pre-commit
  • Run tox
  • Add PR author as the PR assignee
  • Build image based on the PR

Available user actions:

  • To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
  • To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
  • To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
    lgtm label removed on each new commit push.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
    and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
  • To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
    pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.
Supported labels

{'/lgtm', '/build-push-pr-image', '/verified', '/cherry-pick', '/wip', '/hold'}

@threcc threcc force-pushed the singlenode-precise-prefix branch 3 times, most recently from 6222ea4 to 724334f Compare December 4, 2025 11:18
@github-actions github-actions bot added size/xxl and removed size/xl labels Dec 4, 2025
@threcc threcc marked this pull request as ready for review December 4, 2025 11:19
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 4, 2025

📝 Walkthrough

Walkthrough

Adds LLMD prefix-cache support and tests: new constants and liveness probe, two pytest fixtures, Prometheus-aware LLMD test utilities, and an integration test validating single-node estimated prefix-cache routing and metrics.

Changes

Cohort / File(s) Summary
Fixtures & Config Constants
tests/model_serving/model_server/llmd/conftest.py, tests/model_serving/model_server/llmd/constants.py
Added singlenode_estimated_prefix_cache and authenticated_llmisvc_token fixtures; introduced LLMD_LIVENESS_PROBE, PREFIX_CACHE_BLOCK_SIZE, PREFIX_CACHE_HASH_ALGO, PREFIX_CACHE_HASH_SEED, and ROUTER_SCHEDULER_CONFIG_ESTIMATED_PREFIX_CACHE.
Testing Utilities
tests/model_serving/model_server/llmd/utils.py
Added Prometheus-aware helpers: get_llmd_workload_pods(), get_llmd_router_scheduler_pod(), send_prefix_cache_test_requests(), get_metrics_request_count_per_pod(), get_metrics_prefix_cache_hit_rate(), and verify_estimated_prefix_cache_metrics() plus related imports. Note: similar utility blocks appear duplicated in the file.
Integration Test
tests/model_serving/model_server/llmd/test_singlenode_estimated_prefix_cache.py
New test class TestSingleNodeEstimatedPrefixCache with test_singlenode_estimated_prefix_cache validating gateway/LLM service readiness, router-scheduler and workload pod counts, request dispatch with authenticated token, and Prometheus-based metric verification.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Verify Prometheus queries, metric names, and time/window parameters in utils.py.
  • Inspect and deduplicate repeated utility blocks in utils.py.
  • Review fixture logic and dynamic token resolution in conftest.py.
  • Check pod selection/readiness logic and potential race conditions in the integration test.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly matches the main change: adding a test case for single-node with estimated prefix cache functionality.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (8)
tests/model_serving/model_server/llmd/utils.py (4)

197-223: Code duplication with get_llmd_pods inner function.

The logic in get_llmd_workload_pods duplicates the get_llmd_pods inner function defined at lines 112-126 within verify_llmd_no_failed_pods. Consider extracting and reusing get_llmd_workload_pods in verify_llmd_no_failed_pods to eliminate duplication.

 def verify_llmd_no_failed_pods(
     client: DynamicClient,
     llm_service: LLMInferenceService,
     timeout: int = 300,
 ) -> None:
     # ... docstring ...
     from utilities.exceptions import FailedPodsError
     from ocp_resources.resource import Resource

     LOGGER.info(f"Comprehensive health check for LLMInferenceService {llm_service.name}")

     container_wait_base_errors = ["InvalidImageName", "CrashLoopBackOff", "ImagePullBackOff", "ErrImagePull"]
     container_terminated_base_errors = [Resource.Status.ERROR, "CrashLoopBackOff"]

-    def get_llmd_pods():
-        """Get LLMD workload pods for this LLMInferenceService."""
-        pods = []
-        for pod in Pod.get(
-            dyn_client=client,
-            namespace=llm_service.namespace,
-            label_selector=(
-                f"{Pod.ApiGroup.APP_KUBERNETES_IO}/part-of=llminferenceservice,"
-                f"{Pod.ApiGroup.APP_KUBERNETES_IO}/name={llm_service.name}"
-            ),
-        ):
-            labels = pod.instance.metadata.get("labels", {})
-            if labels.get("kserve.io/component") == "workload":
-                pods.append(pod)
-        return pods
+    def get_llmd_pods():
+        return get_llmd_workload_pods(client=client, llmisvc=llm_service)

     for pods in TimeoutSampler(

264-287: Narrow exception handling and make log window configurable.

  1. The bare Exception catch loses valuable diagnostic context. Consider catching specific exceptions (e.g., kubernetes.client.exceptions.ApiException) or at least logging the exception type.

  2. The hardcoded since_seconds=120 may be insufficient in slower test environments.

-def count_chat_completions_requests_in_pod(pod: Pod) -> int:
+def count_chat_completions_requests_in_pod(pod: Pod, since_seconds: int = 120) -> int:
     """
     Count POST /v1/chat/completions requests in pod logs.

     Args:
         pod: The vLLM workload pod to check
+        since_seconds: Time window for log retrieval (default: 120)

     Returns:
         Number of successful chat completion requests found in logs
     """
     try:
-        logs = pod.log(container="main", since_seconds=120)
+        logs = pod.log(container="main", since_seconds=since_seconds)

         # Match: "POST /v1/chat/completions HTTP/1.1" 200
         pattern = r"POST /v1/chat/completions HTTP/1.1.*200"
         matches = re.findall(pattern, logs)
         # ...
-    except Exception as e:
-        LOGGER.info(f"Failed to count requests for pod {pod.name}: {e}")
+    except Exception as e:  # noqa: BLE001
+        LOGGER.warning(f"Failed to count requests for pod {pod.name}: {type(e).__name__}: {e}")
         return 0

290-323: Remove unused parameters and address potential race condition.

  1. The query and timestamp_before parameters are documented as unused and kept for "signature compatibility." If no callers use them, remove them to avoid confusion.

  2. The hardcoded time.sleep(5) could be insufficient for log propagation in some environments. Consider making it configurable.

  3. If multiple pods show increased request counts (possible in race conditions), only the first is returned without warning.

 def get_pod_that_handled_request(
     workload_pods: list[Pod],
-    query: str,
-    timestamp_before: float,
     baseline_counts: dict[str, int],
+    wait_seconds: float = 5.0,
 ) -> str | None:
     """
     Determine which pod handled a request by counting POST requests.

     Args:
         workload_pods: List of vLLM workload pods
-        query: Not used (kept for signature compatibility)
-        timestamp_before: Not used (kept for signature compatibility)
         baseline_counts: Dict of {pod_name: request_count} before this request
+        wait_seconds: Time to wait for logs to propagate (default: 5.0)

     Returns:
         Pod name that handled the request, or None if not found
     """
-    time.sleep(5)
+    time.sleep(wait_seconds)

     current_counts = {}
     for pod in workload_pods:
         current_counts[pod.name] = count_chat_completions_requests_in_pod(pod)

+    handling_pods = []
     for pod in workload_pods:
         baseline = baseline_counts.get(pod.name, 0)
         current = current_counts.get(pod.name, 0)

         if current > baseline:
             LOGGER.info(f"Pod {pod.name} handled request: {baseline} -> {current} (+{current - baseline})")
-            return pod.name
+            handling_pods.append(pod.name)
+
+    if len(handling_pods) == 1:
+        return handling_pods[0]
+    elif len(handling_pods) > 1:
+        LOGGER.warning(f"Multiple pods show increased counts: {handling_pods}, returning first")
+        return handling_pods[0]

     LOGGER.warning("Could not determine which pod handled request")
     return None

361-393: Rename unused loop variable and consider extracting repeated logic.

  1. The loop variable i is unused; rename to _ per Python convention.

  2. The inference request logic is duplicated between Phase 1 (lines 362-388) and Phase 2 (lines 410-436). Consider extracting a helper function.

-    for i in range(3):
+    for _ in range(3):
         inference_config = {

Optionally, extract the repeated inference/tracking logic:

def _send_and_track_request(
    llmisvc: LLMInferenceService,
    prompt: str,
    token: str,
    workload_pods: list[Pod],
    baseline_counts: dict[str, int],
) -> str | None:
    """Send inference request and track which pod handled it."""
    inference_config = {
        "default_query_model": {
            "query_input": prompt,
            "query_output": r".*",
            "use_regex": True,
        },
        "chat_completions": TINYLLAMA_INFERENCE_CONFIG["chat_completions"],
    }
    verify_inference_response_llmd(
        llm_service=llmisvc,
        inference_config=inference_config,
        inference_type="chat_completions",
        protocol=Protocols.HTTPS,
        use_default_query=True,
        insecure=False,
        model_name=llmisvc.instance.spec.model.name,
        token=token,
        authorized_user=True,
    )
    return get_pod_that_handled_request(workload_pods, baseline_counts)
tests/model_serving/model_server/llmd/conftest.py (2)

372-390: Complex environment variable string is fragile.

The VLLM_ADDITIONAL_ARGS value contains embedded JSON with escaped quotes and Go templates. While functional, this is difficult to maintain and debug.

Consider constructing the JSON programmatically and using json.dumps() for reliability:

import json

kv_transfer_config = json.dumps({"kv_connector": "NixlConnector", "kv_role": "kv_both"})
kv_events_config_template = json.dumps({
    "enable_kv_cache_events": True,
    "publisher": "zmq",
    "endpoint": "tcp://{{ ChildName .ObjectMeta.Name `-epp-service` }}:5557",
    "topic": "kv@${POD_IP}@${MODEL_NAME}",
})

vllm_args = (
    f"--prefix-caching-hash-algo {PREFIX_CACHE_HASH_ALGO} "
    f"--block-size {PREFIX_CACHE_BLOCK_SIZE} "
    f"--kv_transfer_config '{kv_transfer_config}' "
    f"--kv-events-config '{kv_events_config_template}'"
)

161-167: Use LLMD_LIVENESS_PROBE constant consistently.

The llmd_inference_service_gpu fixture (lines 161-167) defines the liveness probe inline with identical values to LLMD_LIVENESS_PROBE. For consistency and DRY, use the constant in both places.

+from tests.model_serving.model_server.llmd.constants import LLMD_LIVENESS_PROBE
 # ... in llmd_inference_service_gpu fixture ...
-    liveness_probe = {
-        "httpGet": {"path": "/health", "port": 8000, "scheme": "HTTPS"},
-        "initialDelaySeconds": 120,
-        "periodSeconds": 30,
-        "timeoutSeconds": 30,
-        "failureThreshold": 5,
-    }
+    liveness_probe = LLMD_LIVENESS_PROBE

Also applies to: 391-391

tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py (2)

14-25: Move module docstring to the top of the file.

The module docstring is placed after imports (lines 14-25), which is unconventional. Per PEP 257, module docstrings should be the first statement in the file.

+"""
+Test Single-Node Precise Prefix Caching.
+
+This test verifies that the LLM-D router correctly routes inference requests
+based on cache state, maximizing prefix cache hits.
+
+Test configuration:
+- LLMInferenceService with 2 replicas and router enabled
+- Authentication enabled
+- Verify router pod and vLLM pods are running
+- Send multiple requests with shared prefixes and size greater than PREFIX_CACHE_BLOCK_SIZE
+"""
+
 import pytest
 from kubernetes.dynamic import DynamicClient
 from ocp_resources.llm_inference_service import LLMInferenceService

 from tests.model_serving.model_server.llmd.utils import (
     get_llmd_router_scheduler_pod,
     get_llmd_workload_pods,
     verify_gateway_status,
     verify_llm_service_status,
     verify_singlenode_prefix_cache_routing,
 )
 from simple_logger.logger import get_logger

-"""
-Test Single-Node Precise Prefix Caching.
-...
-"""
-
 LOGGER = get_logger(name=__name__)

41-61: Add comment explaining llmd_gateway dependency and consider instance attribute.

  1. The llmd_gateway parameter ensures the gateway fixture runs first. Add a comment for clarity.

  2. Storing the token as a class attribute (TestSingleNodePrecisePrefixCache.auth_token) could cause issues if tests run in parallel with pytest-xdist. Consider using a class-scoped fixture that yields the token instead.

     @pytest.fixture(scope="class", autouse=True)
     def setup_auth(
         self,
-        llmd_gateway,
+        llmd_gateway,  # Required: ensures gateway is ready before auth setup
         singlenode_precise_prefix_cache,
         llmd_s3_service_account,
         llmisvc_auth_token,
         llmisvc_auth_view_role,
         llmisvc_auth_role_binding,
     ):
         """Set up authentication for single-node prefix cache test."""
         # Create token with RBAC resources using factory fixtures
         token = llmisvc_auth_token(
             service_account=llmd_s3_service_account,
             llmisvc=singlenode_precise_prefix_cache,
             view_role_factory=llmisvc_auth_view_role,
             role_binding_factory=llmisvc_auth_role_binding,
         )

-        # Store token as class attribute for use in tests
-        TestSingleNodePrecisePrefixCache.auth_token = token
+        # Store on instance - safer if tests run in parallel
+        self.__class__.auth_token = token
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23e2790 and 28f7e15.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • tests/model_serving/model_server/llmd/conftest.py (3 hunks)
  • tests/model_serving/model_server/llmd/constants.py (1 hunks)
  • tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py (1 hunks)
  • tests/model_serving/model_server/llmd/utils.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/model_serving/model_server/llmd/utils.py (3)
utilities/constants.py (1)
  • Protocols (97-104)
utilities/exceptions.py (1)
  • PodContainersRestartError (109-110)
utilities/llmd_utils.py (1)
  • verify_inference_response_llmd (393-454)
tests/model_serving/model_server/llmd/conftest.py (2)
utilities/llmd_utils.py (1)
  • create_llmisvc (149-341)
utilities/constants.py (2)
  • ResourceLimits (241-263)
  • GPU (252-263)
tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py (2)
tests/model_serving/model_server/llmd/utils.py (5)
  • get_llmd_router_scheduler_pod (226-251)
  • get_llmd_workload_pods (197-223)
  • verify_gateway_status (28-49)
  • verify_llm_service_status (52-73)
  • verify_singlenode_prefix_cache_routing (326-443)
tests/conftest.py (1)
  • gpu_count_on_cluster (793-814)
🪛 Ruff (0.14.7)
tests/model_serving/model_server/llmd/utils.py

285-285: Do not catch blind exception: Exception

(BLE001)


292-292: Unused function argument: query

(ARG001)


293-293: Unused function argument: timestamp_before

(ARG001)


362-362: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

tests/model_serving/model_server/llmd/conftest.py

347-347: Unused function argument: llmd_s3_secret

(ARG001)


349-349: Unused function argument: llmd_gateway

(ARG001)

tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py

44-44: Unused method argument: llmd_gateway

(ARG002)

🔇 Additional comments (3)
tests/model_serving/model_server/llmd/conftest.py (1)

343-350: Fixture dependencies are intentional for ordering.

The llmd_s3_secret and llmd_gateway parameters appear unused but are required to ensure proper fixture ordering — llmd_s3_secret must exist before llmd_s3_service_account is used, and llmd_gateway must be created before the service.

Consider adding brief comments to clarify this intent for future maintainers.

 @pytest.fixture(scope="class")
 def singlenode_precise_prefix_cache(
     admin_client: DynamicClient,
     unprivileged_model_namespace: Namespace,
-    llmd_s3_secret: Secret,
+    llmd_s3_secret: Secret,  # Required: creates S3 secret before service account
     llmd_s3_service_account: ServiceAccount,
-    llmd_gateway,
+    llmd_gateway,  # Required: ensures gateway is ready before service creation
 ) -> Generator[LLMInferenceService, None, None]:
tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py (1)

63-92: LGTM! Well-structured test with clear preconditions.

The test properly:

  • Skips when insufficient GPUs are available
  • Validates infrastructure readiness before routing tests
  • Uses descriptive assertion messages
  • Delegates complex routing validation to a helper function

One minor suggestion: the llmd_gateway parameter appears unused in the method signature but is likely needed for fixture ordering. Add a brief comment or use _ prefix convention.

     def test_singlenode_precise_prefix_cache(
         self,
         unprivileged_client: DynamicClient,
-        llmd_gateway,
+        llmd_gateway,  # noqa: ARG002 - Required for fixture ordering
         singlenode_precise_prefix_cache: LLMInferenceService,
         gpu_count_on_cluster: int,
     ):
tests/model_serving/model_server/llmd/constants.py (1)

1-44: LGTM! Well-organized constants.

The configuration constants are clearly defined and appropriately grouped. The liveness probe settings and scheduler configuration follow the expected API patterns. The PREFIX_CACHE_HASH_SEED is correctly defined as a string "42", which aligns with vLLM's expectation for hash seed values (read from the PYTHONHASHSEED environment variable).

mwaykole
mwaykole previously approved these changes Dec 4, 2025
Copy link
Copy Markdown
Member

@mwaykole mwaykole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
tests/model_serving/model_server/llmd/conftest.py (1)

343-441: singlenode_precise_prefix_cache fixture: clean overall; drop redundant secret arg and make gateway dependency explicit

The fixture wiring into create_llmisvc looks solid and matches the existing LLMD patterns. Two small cleanups:

  • llmd_s3_secret is unused here and redundant because llmd_s3_service_account already depends on it; you can safely drop it from the signature.
  • llmd_gateway is intentionally unused but forces the gateway fixture to run; it’s worth touching it explicitly so this is clear and to satisfy ARG001.

For example:

 @pytest.fixture(scope="class")
 def singlenode_precise_prefix_cache(
     admin_client: DynamicClient,
     unprivileged_model_namespace: Namespace,
-    llmd_s3_secret: Secret,
     llmd_s3_service_account: ServiceAccount,
     llmd_gateway,
 ) -> Generator[LLMInferenceService, None, None]:
     """LLMInferenceService fixture for single-node precise prefix cache test."""
 
+    # Ensure llmd_gateway fixture is instantiated even though we don't use it directly here
+    _ = llmd_gateway
+
     with create_llmisvc(
         client=admin_client,
tests/model_serving/model_server/llmd/utils.py (3)

254-287: Broad exception handling in count_chat_completions_requests_in_pod

Catching a blanket Exception here hides all failures (including programming errors) and just returns 0, which could make debugging routing issues harder.

Given this is test-only code and log scraping is inherently flaky, two options:

  • Narrow the exception to the known pod/log retrieval errors, or
  • Keep the broad catch but explicitly mark it as intentional for linting:
-    except Exception as e:
-        LOGGER.info(f"Failed to count requests for pod {pod.name}: {e}")
-        return 0
+    except Exception as e:  # noqa: BLE001 - tolerate log scraping failures in tests
+        LOGGER.info(f"Failed to count requests for pod {pod.name}: {e}")
+        return 0

361-363: Rename unused loop variable in Phase 1 for clarity

The loop index isn’t used; renaming it to _ makes that explicit and satisfies linting:

-    phase1_pods = []
-    for i in range(3):
+    phase1_pods = []
+    for _ in range(3):

197-223: Pod selection logic duplicated with verify_llmd_no_failed_pods helper

get_llmd_workload_pods uses the same label selector and filtering as the inner get_llmd_pods inside verify_llmd_no_failed_pods above. To keep things DRY and avoid future divergence, consider refactoring verify_llmd_no_failed_pods to call this new helper instead of maintaining two copies of the same logic.

No behavior change is required for this PR; this is a good follow-up cleanup target.

tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py (1)

41-61: Clarify intentional use of llmd_gateway in setup_auth to satisfy lint

llmd_gateway in setup_auth isn’t referenced, but it’s clearly there to ensure the gateway fixture is instantiated before creating the auth token and service. To make that intent explicit and silence ARG002, you can touch it:

     def setup_auth(
         self,
         llmd_gateway,
         singlenode_precise_prefix_cache,
@@
     ):
         """Set up authentication for single-node prefix cache test."""
-        # Create token with RBAC resources using factory fixtures
+        # Force llmd_gateway fixture instantiation even though we don't use it directly
+        _ = llmd_gateway
+
+        # Create token with RBAC resources using factory fixtures
         token = llmisvc_auth_token(
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 28f7e15 and 159fd07.

📒 Files selected for processing (4)
  • tests/model_serving/model_server/llmd/conftest.py (3 hunks)
  • tests/model_serving/model_server/llmd/constants.py (1 hunks)
  • tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py (1 hunks)
  • tests/model_serving/model_server/llmd/utils.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/model_serving/model_server/llmd/constants.py
🧰 Additional context used
🧬 Code graph analysis (2)
tests/model_serving/model_server/llmd/utils.py (2)
utilities/constants.py (1)
  • Protocols (97-104)
utilities/llmd_utils.py (1)
  • verify_inference_response_llmd (393-454)
tests/model_serving/model_server/llmd/conftest.py (3)
utilities/llmd_utils.py (1)
  • create_llmisvc (149-341)
utilities/constants.py (4)
  • ModelStorage (289-306)
  • ResourceLimits (241-263)
  • GPU (252-263)
  • Timeout (227-238)
utilities/llmd_constants.py (2)
  • ModelStorage (30-37)
  • ModelNames (46-50)
🪛 Ruff (0.14.7)
tests/model_serving/model_server/llmd/utils.py

285-285: Do not catch blind exception: Exception

(BLE001)


292-292: Unused function argument: query

(ARG001)


293-293: Unused function argument: timestamp_before

(ARG001)


362-362: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

tests/model_serving/model_server/llmd/conftest.py

347-347: Unused function argument: llmd_s3_secret

(ARG001)


349-349: Unused function argument: llmd_gateway

(ARG001)

tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py

44-44: Unused method argument: llmd_gateway

(ARG002)

@threcc threcc disabled auto-merge December 4, 2025 13:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py (1)

41-61: Consider yielding the token from the fixture instead of using a class attribute.

Storing the token as a class attribute (TestSingleNodePrecisePrefixCache.auth_token) is a less idiomatic pytest pattern. Consider yielding the token from a separate fixture so tests can access it directly via fixture injection. This improves test isolation and follows pytest conventions.

Example approach:

@pytest.fixture(scope="class")
def prefix_cache_auth_token(
    llmd_gateway,
    singlenode_precise_prefix_cache,
    llmd_s3_service_account,
    llmisvc_auth_token,
    llmisvc_auth_view_role,
    llmisvc_auth_role_binding,
):
    """Set up authentication and yield token for single-node prefix cache test."""
    token = llmisvc_auth_token(
        service_account=llmd_s3_service_account,
        llmisvc=singlenode_precise_prefix_cache,
        view_role_factory=llmisvc_auth_view_role,
        role_binding_factory=llmisvc_auth_role_binding,
    )
    yield token

Then inject prefix_cache_auth_token directly into the test method.

tests/model_serving/model_server/llmd/utils.py (1)

301-301: Fix time.sleep keyword argument - will cause TypeError.

time.sleep() does not accept keyword arguments. This will raise TypeError: sleep() takes no keyword arguments at runtime.

-    time.sleep(secs=5)
+    time.sleep(5)  # Allow logs to flush before counting requests
🧹 Nitpick comments (4)
tests/model_serving/model_server/llmd/conftest.py (1)

161-167: Consider reusing LLMD_LIVENESS_PROBE constant for consistency.

This inline liveness probe definition is identical to the new LLMD_LIVENESS_PROBE constant in constants.py. Consider using the constant here to reduce duplication and ensure consistency.

+from tests.model_serving.model_server.llmd.constants import LLMD_LIVENESS_PROBE
 ...
-    liveness_probe = {
-        "httpGet": {"path": "/health", "port": 8000, "scheme": "HTTPS"},
-        "initialDelaySeconds": 120,
-        "periodSeconds": 30,
-        "timeoutSeconds": 30,
-        "failureThreshold": 5,
-    }
+    liveness_probe = LLMD_LIVENESS_PROBE
tests/model_serving/model_server/llmd/utils.py (3)

285-287: Consider catching more specific exceptions.

While the broad exception handling provides resilience for this log-counting helper, consider catching more specific exceptions (e.g., kubernetes.client.exceptions.ApiException) to avoid masking unexpected errors during debugging.


355-355: Rename unused loop variable.

The loop variable i is not used. Rename to _ to indicate intentional non-use.

-    for i in range(3):
+    for _ in range(3):

112-126: Refactor to use get_llmd_workload_pods utility.

The internal get_llmd_pods() function duplicates the logic of the new get_llmd_workload_pods() utility (lines 197-223). Consider refactoring to use the shared utility.

     def get_llmd_pods():
         """Get LLMD workload pods for this LLMInferenceService."""
-        pods = []
-        for pod in Pod.get(
-            dyn_client=client,
-            namespace=llm_service.namespace,
-            label_selector=(
-                f"{Pod.ApiGroup.APP_KUBERNETES_IO}/part-of=llminferenceservice,"
-                f"{Pod.ApiGroup.APP_KUBERNETES_IO}/name={llm_service.name}"
-            ),
-        ):
-            labels = pod.instance.metadata.get("labels", {})
-            if labels.get("kserve.io/component") == "workload":
-                pods.append(pod)
-        return pods
+        return get_llmd_workload_pods(client=client, llmisvc=llm_service)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 159fd07 and 12c2b2f.

📒 Files selected for processing (4)
  • tests/model_serving/model_server/llmd/conftest.py (3 hunks)
  • tests/model_serving/model_server/llmd/constants.py (1 hunks)
  • tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py (1 hunks)
  • tests/model_serving/model_server/llmd/utils.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/model_serving/model_server/llmd/utils.py (4)
utilities/constants.py (1)
  • Protocols (97-104)
utilities/exceptions.py (1)
  • PodContainersRestartError (109-110)
utilities/llmd_utils.py (1)
  • verify_inference_response_llmd (393-454)
tests/model_serving/conftest.py (1)
  • protocol (26-36)
tests/model_serving/model_server/llmd/conftest.py (1)
utilities/llmd_utils.py (1)
  • create_llmisvc (149-341)
🪛 Ruff (0.14.7)
tests/model_serving/model_server/llmd/utils.py

285-285: Do not catch blind exception: Exception

(BLE001)


355-355: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

tests/model_serving/model_server/llmd/conftest.py

347-347: Unused function argument: llmd_s3_secret

(ARG001)


349-349: Unused function argument: llmd_gateway

(ARG001)

tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py

44-44: Unused method argument: llmd_gateway

(ARG002)

🔇 Additional comments (4)
tests/model_serving/model_server/llmd/conftest.py (1)

343-350: Fixture dependency parameters are correctly used for ordering.

The llmd_s3_secret and llmd_gateway parameters are intentionally declared to ensure proper fixture ordering/dependency resolution even though they're not directly referenced in the function body. This is a valid pytest pattern. Consider adding a brief comment to clarify this intent for future maintainers.

tests/model_serving/model_server/llmd/constants.py (1)

1-44: Well-organized constants module.

The constants are clearly documented and properly structured. The scheduler configuration correctly references the prefix cache parameters for consistency.

tests/model_serving/model_server/llmd/test_singlenode_precise_prefix_cache.py (1)

63-94: Integration test structure is acceptable, though could be modularized.

While a previous reviewer suggested splitting into smaller tests, the current structure is reasonable for an end-to-end integration test where phases are sequential and interdependent. The assertions at each step provide clear failure points.

tests/model_serving/model_server/llmd/utils.py (1)

319-439: Well-structured routing verification with clear test phases.

The function properly validates prefix cache routing behavior:

  • Phase 1 verifies repeated identical prompts route to the same pod (full cache hit)
  • Phase 2 verifies prompts with shared prefixes route together (partial cache hit)

The baseline counting approach for request attribution is reasonable given the log-based detection method.

@threcc threcc force-pushed the singlenode-precise-prefix branch from 12c2b2f to 3c86b86 Compare December 4, 2025 15:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/model_serving/model_server/llmd/utils.py (1)

401-402: Replace time.sleep() with TimeoutSampler and remove invalid noqa directive.

A past reviewer requested replacing the fixed sleep with TimeoutSampler to poll for metrics availability. The noqa: FCN001 directive is also invalid and should be removed.

Based on past review comments, use TimeoutSampler to poll Prometheus until metrics are available rather than blindly sleeping:

-    # Wait for Prometheus to scrape metrics
-    LOGGER.info("Waiting 30 seconds for Prometheus to scrape metrics")
-    time.sleep(30)  # noqa: FCN001
+    # Poll Prometheus for metrics availability
+    LOGGER.info("Waiting for Prometheus to scrape metrics")
+    for _ in TimeoutSampler(
+        wait_timeout=60,
+        sleep=5,
+        func=lambda: get_metrics_request_count_per_pod(prometheus, llmisvc, workload_pods),
+        exceptions=(Exception,),
+    ):
+        # Verify we have non-zero metrics before proceeding
+        if sum(get_metrics_request_count_per_pod(prometheus, llmisvc, workload_pods).values()) > 0:
+            break
🧹 Nitpick comments (1)
tests/model_serving/model_server/llmd/utils.py (1)

419-419: Prefer next() over list comprehension with index.

Using next() is more idiomatic and provides clearer error handling if no match is found.

Apply this diff:

-    active_pod = [name for name, count in pods_request_counts.items() if count == expected_requests][0]
+    active_pod = next(name for name, count in pods_request_counts.items() if count == expected_requests)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cbcc5ae and 4d58233.

📒 Files selected for processing (1)
  • tests/model_serving/model_server/llmd/utils.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/model_serving/model_server/llmd/utils.py (6)
tests/conftest.py (1)
  • prometheus (665-673)
utilities/constants.py (1)
  • Protocols (97-104)
utilities/exceptions.py (1)
  • PodContainersRestartError (109-110)
utilities/llmd_utils.py (1)
  • verify_inference_response_llmd (393-454)
utilities/monitoring.py (1)
  • get_metrics_value (10-24)
tests/model_serving/conftest.py (1)
  • protocol (26-36)
🪛 Flake8 (7.3.0)
tests/model_serving/model_server/llmd/utils.py

[error] 312-312: undefined name 'i'

(F821)

🪛 Ruff (0.14.8)
tests/model_serving/model_server/llmd/utils.py

311-311: Do not catch blind exception: Exception

(BLE001)


312-312: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


312-312: Undefined name i

(F821)


402-402: Unused noqa directive (unknown: FCN001)

Remove unused noqa directive

(RUF100)


419-419: Prefer next(...) over single element slice

Replace with next(...)

(RUF015)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/model_serving/model_server/llmd/utils.py (1)

254-317: Tighten exception handling and log with traceback in prefix‑cache request loop

The loop does the right thing in terms of driving repeated requests, but the except Exception as e + LOGGER.error(...) pattern is both noisy for static analysis and hides the traceback that would be useful when debugging failures.

Given verify_inference_response_llmd documents raising InferenceResponseError and ValueError, you can narrow the catch and use LOGGER.exception:

-from utilities.exceptions import PodContainersRestartError
+from utilities.exceptions import PodContainersRestartError, InferenceResponseError
-        try:
-            verify_inference_response_llmd(
-                llm_service=llmisvc,
-                inference_config=inference_config,
-                inference_type="chat_completions",
-                protocol=Protocols.HTTPS,
-                use_default_query=True,
-                insecure=False,
-                model_name=llmisvc.instance.spec.model.name,
-                token=token,
-                authorized_user=True,
-            )
-            successful_requests += 1
-        except Exception as e:
-            LOGGER.error(f"Request {index + 1} failed: {e}")
-            failed_requests += 1
+        try:
+            verify_inference_response_llmd(
+                llm_service=llmisvc,
+                inference_config=inference_config,
+                inference_type="chat_completions",
+                protocol=Protocols.HTTPS,
+                use_default_query=True,
+                insecure=False,
+                model_name=llmisvc.instance.spec.model.name,
+                token=token,
+                authorized_user=True,
+            )
+            successful_requests += 1
+        except (InferenceResponseError, ValueError) as exc:
+            LOGGER.exception(f"Request {index + 1} failed validation: {exc}")
+            failed_requests += 1

This keeps the “keep going and count failures” behavior, but avoids a blind catch‑all and gives you stack traces in logs, which should also satisfy the BLE001/TRY400 lints.

🧹 Nitpick comments (1)
tests/model_serving/model_server/llmd/utils.py (1)

197-251: Workload/router pod discovery helpers look correct; consider de‑duplicating label selector logic

The new get_llmd_workload_pods and get_llmd_router_scheduler_pod helpers use selectors consistent with the inner get_llmd_pods in verify_llmd_no_failed_pods, and the label keys/values look right for workload vs router-scheduler pods.

To avoid drift if labels change later, consider reusing these helpers (or a shared selector builder) inside verify_llmd_no_failed_pods instead of duplicating the selector and workload filter logic.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d58233 and cab8064.

📒 Files selected for processing (1)
  • tests/model_serving/model_server/llmd/utils.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/model_serving/model_server/llmd/utils.py (4)
utilities/constants.py (1)
  • Protocols (97-104)
utilities/exceptions.py (1)
  • PodContainersRestartError (109-110)
utilities/llmd_utils.py (1)
  • verify_inference_response_llmd (393-454)
utilities/monitoring.py (1)
  • get_metrics_value (10-24)
🪛 Ruff (0.14.8)
tests/model_serving/model_server/llmd/utils.py

310-310: Do not catch blind exception: Exception

(BLE001)


311-311: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


417-417: Prefer next(...) over single element slice

Replace with next(...)

(RUF015)

Copy link
Copy Markdown
Member

@pierDipi pierDipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@pierDipi pierDipi requested a review from dbasunag December 9, 2025 17:43
@threcc threcc enabled auto-merge (squash) December 10, 2025 09:36
@threcc threcc merged commit 2472b3a into opendatahub-io:main Dec 10, 2025
8 checks passed
@github-actions
Copy link
Copy Markdown

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

@threcc threcc deleted the singlenode-precise-prefix branch December 11, 2025 14:24
mwaykole pushed a commit to mwaykole/opendatahub-tests that referenced this pull request Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants