Skip to content

Commit c59380c

Browse files
feat: Add garak benchmark tests for EvalHub with KFP provider (#1361)
* feat: add smoke and t1 markers Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * test(evalhub): add garak benchmark tests with KFP provider Add end-to-end tests for running garak security evaluations via EvalHub with the garak-kfp provider. Tests verify EvalHub health, provider availability, job submission, and job completion using an LLM-d inference simulator. - Add test_garak.py with TestGarakBenchmark test class - Add EvalHub and MLflow custom resource wrappers - Add conftest fixtures for EvalHub CR, MLflow, tenant namespace, DSPA, RBAC, and inference simulator setup/teardown - Add utility functions for EvalHub API interactions (health, providers, job submission, polling) - Add patched_dsc_garak_kfp fixture for DSC configuration - Add evalhub constants for API paths, provider IDs, and tenant labels Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * fix(evalhub): fix garak-kfp scan timeout and SKIP results with llm-d simulator Two bugs were found and fixed while investigating garak scan failures against the llm-d inference simulator in the KFP execution path: **Fix 1: Scan timeout due to wrong service port in model URL** The `garak_sim_isvc_url` fixture was building the model URL with the container port (8032) directly. KServe creates a headless Service that maps port 80 → 8032, but with headless services the port mapping is not applied via kube-proxy — connections go directly to the pod IP. When the KFP garak pod (in the tenant namespace) tried to reach the model via the Service DNS name on port 80, it got "Connection refused", causing garak to hang in its backoff-retry loop until the 600s scan timeout was hit. Fix: remove the explicit port from the URL so it defaults to port 80 (the Service's exposed port), which resolves correctly via DNS to the pod IP and port 8032 for direct pod-to-pod traffic. **Fix 2: Garak scan completes but all probes report SKIP ok on 0/0** After fixing the timeout, scans completed but every detector result was SKIP with 0 attempts evaluated. Root cause: the llm-d inference simulator defaults to a 1024-token context window, but the DAN probe prompt is ~1067 tokens. With max_tokens=150 for the completion, the total (1217) exceeded the limit and the simulator returned HTTP 400. Garak catches BadRequestError and returns None for that generation, which propagates to the detector as a None result, and the evaluator reports SKIP when passes + fails == 0. Fix: pass `--max-model-len 8192` to the simulator via a new `max_model_len` field on `LLMdInferenceSimConfig`, giving the simulator enough context to handle all garak probe prompts. **Fix 3: Race condition — model pod not ready before scan submission** The `create_isvc` call uses `wait_for_predictor_pods=False` because the standard KServe wait helper does not support RawDeployment mode. This could allow the garak job to be submitted before the simulator pod was serving, causing immediate connection errors. Fix: add a `garak_sim_isvc_ready` fixture that explicitly waits for the predictor Deployment to have ready replicas before the test proceeds, and wire it into `test_submit_garak_job`. All 4 tests now pass end-to-end (health, providers, submit, completion). Made-with: Cursor Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * fix: test for Garak fix Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * fix: resolve pre-commit issues after rebase - Remove duplicate test_lmeval_huggingface_model_tier2 function - Fix bare Exception catch in wait_for_service_account - Auto-format fixes from ruff Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * fix: remove duplicate database param in EvalHub and fix except syntax Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * refactor: deduplicate EVALHUB_USER_ROLE_RULES and reuse wait_for_evalhub_job Move EVALHUB_USER_ROLE_RULES to shared constants.py and import it in both garak and multi-tenancy conftest files. Refactor wait_for_job_completion to delegate to wait_for_evalhub_job instead of reimplementing polling logic. Fix except syntax for ValueError/AttributeError. Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * fix: move base64 import to top of file Address PR review comment to move the inline import to module level. Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * feat: update EvalHub and MLflow to match generated classes from openshift-python-wrapper#2691 Update resource classes to match the upstream generated code: - EvalHub: add collections and otel fields, change providers type to list[Any] - MLflow: change from NamespacedResource to Resource (cluster-scoped), add all new spec fields, remove namespace param from fixture Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * fix: add readiness waits for EvalHub route and operator RBAC The providers MT tests failed with 503 when run in the full suite because the OpenShift router hadn't fully configured the backend despite the deployment reporting ready replicas. Add evalhub_mt_ready fixture that polls the health endpoint before tests execute. The garak KFP job failed because the evalhub-service SA lacked configmap permissions in the tenant namespace. The operator provisions these via RoleBindings but the test didn't wait for that. Add garak_tenant_rbac_ready fixture that waits for job-config and jobs-writer RoleBindings. Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> * fix: move requests import to top of file in multitenancy conftest Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> --------- Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com> Co-authored-by: saichandrapandraju <saichandrapandraju@gmail.com>
1 parent 64e4e96 commit c59380c

File tree

11 files changed

+1064
-29
lines changed

11 files changed

+1064
-29
lines changed

tests/fixtures/inference.py

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
KServeDeploymentType,
2222
LLMdInferenceSimConfig,
2323
RuntimeTemplates,
24+
Timeout,
2425
VLLMGPUConfig,
2526
)
2627
from utilities.inference_utils import create_isvc
@@ -197,7 +198,7 @@ def llm_d_inference_sim_isvc(
197198
deployment_mode=KServeDeploymentType.RAW_DEPLOYMENT,
198199
model_format=LLMdInferenceSimConfig.name,
199200
runtime=llm_d_inference_sim_serving_runtime.name,
200-
wait_for_predictor_pods=True,
201+
wait_for_predictor_pods=False,
201202
min_replicas=1,
202203
max_replicas=1,
203204
resources={
@@ -206,6 +207,12 @@ def llm_d_inference_sim_isvc(
206207
},
207208
teardown=teardown_resources,
208209
) as isvc:
210+
deployment = Deployment(
211+
client=admin_client,
212+
name=f"{isvc.name}-predictor",
213+
namespace=model_namespace.name,
214+
)
215+
deployment.wait_for_replicas(timeout=Timeout.TIMEOUT_2MIN)
209216
yield isvc
210217

211218

@@ -326,3 +333,33 @@ def get_vllm_chat_config(namespace: str) -> dict[str, Any]:
326333
"port": VLLMGPUConfig.port,
327334
}
328335
}
336+
337+
338+
@pytest.fixture(scope="class")
339+
def patched_dsc_garak_kfp(admin_client) -> Generator[DataScienceCluster]:
340+
"""Configure the DataScienceCluster for Garak and KFP (Kubeflow Pipelines) testing.
341+
342+
This fixture patches the DataScienceCluster to enable:
343+
- KServe in Headed mode (using Service port instead of Pod port)
344+
- AI Pipelines component in Managed state
345+
- MLflow operator in Managed state
346+
347+
Waits for the DSC to be ready before yielding.
348+
"""
349+
350+
dsc = get_data_science_cluster(client=admin_client)
351+
with ResourceEditor(
352+
patches={
353+
dsc: {
354+
"spec": {
355+
"components": {
356+
"kserve": {"rawDeploymentServiceConfig": "Headed"},
357+
"aipipelines": {"managementState": "Managed"},
358+
"mlflowoperator": {"managementState": "Managed"},
359+
}
360+
}
361+
}
362+
}
363+
):
364+
wait_for_dsc_status_ready(dsc_resource=dsc)
365+
yield dsc

0 commit comments

Comments
 (0)