Commit c59380c
feat: Add garak benchmark tests for EvalHub with KFP provider (#1361)
* feat: add smoke and t1 markers
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* test(evalhub): add garak benchmark tests with KFP provider
Add end-to-end tests for running garak security evaluations via EvalHub
with the garak-kfp provider. Tests verify EvalHub health, provider
availability, job submission, and job completion using an LLM-d inference
simulator.
- Add test_garak.py with TestGarakBenchmark test class
- Add EvalHub and MLflow custom resource wrappers
- Add conftest fixtures for EvalHub CR, MLflow, tenant namespace, DSPA,
RBAC, and inference simulator setup/teardown
- Add utility functions for EvalHub API interactions (health, providers,
job submission, polling)
- Add patched_dsc_garak_kfp fixture for DSC configuration
- Add evalhub constants for API paths, provider IDs, and tenant labels
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* fix(evalhub): fix garak-kfp scan timeout and SKIP results with llm-d simulator
Two bugs were found and fixed while investigating garak scan failures
against the llm-d inference simulator in the KFP execution path:
**Fix 1: Scan timeout due to wrong service port in model URL**
The `garak_sim_isvc_url` fixture was building the model URL with the
container port (8032) directly. KServe creates a headless Service that
maps port 80 → 8032, but with headless services the port mapping is not
applied via kube-proxy — connections go directly to the pod IP. When the
KFP garak pod (in the tenant namespace) tried to reach the model via the
Service DNS name on port 80, it got "Connection refused", causing garak
to hang in its backoff-retry loop until the 600s scan timeout was hit.
Fix: remove the explicit port from the URL so it defaults to port 80
(the Service's exposed port), which resolves correctly via DNS to the
pod IP and port 8032 for direct pod-to-pod traffic.
**Fix 2: Garak scan completes but all probes report SKIP ok on 0/0**
After fixing the timeout, scans completed but every detector result was
SKIP with 0 attempts evaluated. Root cause: the llm-d inference
simulator defaults to a 1024-token context window, but the DAN probe
prompt is ~1067 tokens. With max_tokens=150 for the completion, the
total (1217) exceeded the limit and the simulator returned HTTP 400.
Garak catches BadRequestError and returns None for that generation,
which propagates to the detector as a None result, and the evaluator
reports SKIP when passes + fails == 0.
Fix: pass `--max-model-len 8192` to the simulator via a new
`max_model_len` field on `LLMdInferenceSimConfig`, giving the simulator
enough context to handle all garak probe prompts.
**Fix 3: Race condition — model pod not ready before scan submission**
The `create_isvc` call uses `wait_for_predictor_pods=False` because the
standard KServe wait helper does not support RawDeployment mode. This
could allow the garak job to be submitted before the simulator pod was
serving, causing immediate connection errors.
Fix: add a `garak_sim_isvc_ready` fixture that explicitly waits for the
predictor Deployment to have ready replicas before the test proceeds,
and wire it into `test_submit_garak_job`.
All 4 tests now pass end-to-end (health, providers, submit, completion).
Made-with: Cursor
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* fix: test for Garak fix
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* fix: resolve pre-commit issues after rebase
- Remove duplicate test_lmeval_huggingface_model_tier2 function
- Fix bare Exception catch in wait_for_service_account
- Auto-format fixes from ruff
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* fix: remove duplicate database param in EvalHub and fix except syntax
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* refactor: deduplicate EVALHUB_USER_ROLE_RULES and reuse wait_for_evalhub_job
Move EVALHUB_USER_ROLE_RULES to shared constants.py and import it in both
garak and multi-tenancy conftest files. Refactor wait_for_job_completion to
delegate to wait_for_evalhub_job instead of reimplementing polling logic.
Fix except syntax for ValueError/AttributeError.
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* fix: move base64 import to top of file
Address PR review comment to move the inline import to module level.
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* feat: update EvalHub and MLflow to match generated classes from openshift-python-wrapper#2691
Update resource classes to match the upstream generated code:
- EvalHub: add collections and otel fields, change providers type to list[Any]
- MLflow: change from NamespacedResource to Resource (cluster-scoped),
add all new spec fields, remove namespace param from fixture
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* fix: add readiness waits for EvalHub route and operator RBAC
The providers MT tests failed with 503 when run in the full suite because
the OpenShift router hadn't fully configured the backend despite the
deployment reporting ready replicas. Add evalhub_mt_ready fixture that
polls the health endpoint before tests execute.
The garak KFP job failed because the evalhub-service SA lacked configmap
permissions in the tenant namespace. The operator provisions these via
RoleBindings but the test didn't wait for that. Add garak_tenant_rbac_ready
fixture that waits for job-config and jobs-writer RoleBindings.
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
* fix: move requests import to top of file in multitenancy conftest
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
---------
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
Co-authored-by: saichandrapandraju <saichandrapandraju@gmail.com>1 parent 64e4e96 commit c59380c
File tree
11 files changed
+1064
-29
lines changed- tests
- fixtures
- model_explainability
- evalhub
- multitenancy
- guardrails
- utilities
- resources
11 files changed
+1064
-29
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
24 | 25 | | |
25 | 26 | | |
26 | 27 | | |
| |||
197 | 198 | | |
198 | 199 | | |
199 | 200 | | |
200 | | - | |
| 201 | + | |
201 | 202 | | |
202 | 203 | | |
203 | 204 | | |
| |||
206 | 207 | | |
207 | 208 | | |
208 | 209 | | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
209 | 216 | | |
210 | 217 | | |
211 | 218 | | |
| |||
326 | 333 | | |
327 | 334 | | |
328 | 335 | | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
0 commit comments