feat: Add garak benchmark tests for EvalHub with KFP provider #1361
feat: Add garak benchmark tests for EvalHub with KFP provider #1361dbasunag merged 12 commits intoopendatahub-io:mainfrom
Conversation
|
The following are automatically added/executed:
Available user actions:
Supported labels{'/lgtm', '/cherry-pick', '/hold', '/verified', '/build-push-pr-image', '/wip'} |
4e1a83a to
b0ac8bf
Compare
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
Add end-to-end tests for running garak security evaluations via EvalHub with the garak-kfp provider. Tests verify EvalHub health, provider availability, job submission, and job completion using an LLM-d inference simulator. - Add test_garak.py with TestGarakBenchmark test class - Add EvalHub and MLflow custom resource wrappers - Add conftest fixtures for EvalHub CR, MLflow, tenant namespace, DSPA, RBAC, and inference simulator setup/teardown - Add utility functions for EvalHub API interactions (health, providers, job submission, polling) - Add patched_dsc_garak_kfp fixture for DSC configuration - Add evalhub constants for API paths, provider IDs, and tenant labels Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
…simulator Two bugs were found and fixed while investigating garak scan failures against the llm-d inference simulator in the KFP execution path: **Fix 1: Scan timeout due to wrong service port in model URL** The `garak_sim_isvc_url` fixture was building the model URL with the container port (8032) directly. KServe creates a headless Service that maps port 80 → 8032, but with headless services the port mapping is not applied via kube-proxy — connections go directly to the pod IP. When the KFP garak pod (in the tenant namespace) tried to reach the model via the Service DNS name on port 80, it got "Connection refused", causing garak to hang in its backoff-retry loop until the 600s scan timeout was hit. Fix: remove the explicit port from the URL so it defaults to port 80 (the Service's exposed port), which resolves correctly via DNS to the pod IP and port 8032 for direct pod-to-pod traffic. **Fix 2: Garak scan completes but all probes report SKIP ok on 0/0** After fixing the timeout, scans completed but every detector result was SKIP with 0 attempts evaluated. Root cause: the llm-d inference simulator defaults to a 1024-token context window, but the DAN probe prompt is ~1067 tokens. With max_tokens=150 for the completion, the total (1217) exceeded the limit and the simulator returned HTTP 400. Garak catches BadRequestError and returns None for that generation, which propagates to the detector as a None result, and the evaluator reports SKIP when passes + fails == 0. Fix: pass `--max-model-len 8192` to the simulator via a new `max_model_len` field on `LLMdInferenceSimConfig`, giving the simulator enough context to handle all garak probe prompts. **Fix 3: Race condition — model pod not ready before scan submission** The `create_isvc` call uses `wait_for_predictor_pods=False` because the standard KServe wait helper does not support RawDeployment mode. This could allow the garak job to be submitted before the simulator pod was serving, causing immediate connection errors. Fix: add a `garak_sim_isvc_ready` fixture that explicitly waits for the predictor Deployment to have ready replicas before the test proceeds, and wire it into `test_submit_garak_job`. All 4 tests now pass end-to-end (health, providers, submit, completion). Made-with: Cursor Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
- Remove duplicate test_lmeval_huggingface_model_tier2 function - Fix bare Exception catch in wait_for_service_account - Auto-format fixes from ruff Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
dbasunag
left a comment
There was a problem hiding this comment.
Since this is a big PR, won't hold it for the import comment, but please address it in a follow up PR.
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/model_explainability/evalhub/multitenancy/conftest.py (1)
102-126: Consider narrowing the exception catch for better diagnostics.Line 119:
exceptions_dict={Exception: []}suppresses all exceptions during retry, which could mask configuration issues (e.g., persistent SSL certificate errors, DNS resolution failures). Usingrequests.exceptions.RequestExceptionwould still allow retries on transient network failures while surfacing unexpected errors.♻️ Suggested refinement
+from requests.exceptions import RequestException + `@pytest.fixture`(scope="class") def evalhub_mt_ready( evalhub_mt_route: Route, evalhub_mt_ca_bundle_file: str, ) -> None: ... try: for sample in TimeoutSampler( wait_timeout=120, sleep=5, func=lambda: requests.get(url, verify=evalhub_mt_ca_bundle_file, timeout=10), - exceptions_dict={Exception: []}, + exceptions_dict={RequestException: []}, ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/model_explainability/evalhub/multitenancy/conftest.py` around lines 102 - 126, The TimeoutSampler in the evalhub_mt_ready fixture currently swallows all exceptions via exceptions_dict={Exception: []}; narrow this to only network/request errors by replacing Exception with requests.exceptions.RequestException in the TimeoutSampler call (and add the requests.exceptions import if missing) so transient network/SSL/DNS errors are retried but unexpected exceptions are not suppressed; keep the rest of the polling logic (url, verify, timeout) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/model_explainability/evalhub/multitenancy/conftest.py`:
- Around line 102-126: The TimeoutSampler in the evalhub_mt_ready fixture
currently swallows all exceptions via exceptions_dict={Exception: []}; narrow
this to only network/request errors by replacing Exception with
requests.exceptions.RequestException in the TimeoutSampler call (and add the
requests.exceptions import if missing) so transient network/SSL/DNS errors are
retried but unexpected exceptions are not suppressed; keep the rest of the
polling logic (url, verify, timeout) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 0563c3a4-d3ff-4f61-9c29-89a849823393
📒 Files selected for processing (1)
tests/model_explainability/evalhub/multitenancy/conftest.py
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/model_explainability/evalhub/multitenancy/conftest.py (1)
114-125: Overly broad exception suppression masks programming errors.
exceptions_dict={Exception: []}catches all exceptions includingAttributeError,TypeError, and other programming errors that indicate bugs rather than transient network issues. This can hide legitimate failures and make debugging difficult.Narrow the scope to network-related exceptions:
♻️ Proposed fix
+from requests.exceptions import RequestException + ... try: for sample in TimeoutSampler( wait_timeout=120, sleep=5, func=lambda: requests.get(url, verify=evalhub_mt_ca_bundle_file, timeout=10), - exceptions_dict={Exception: []}, + exceptions_dict={RequestException: []}, ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/model_explainability/evalhub/multitenancy/conftest.py` around lines 114 - 125, The TimeoutSampler call currently uses exceptions_dict={Exception: []}, which is too broad and hides programming errors; update the TimeoutSampler invocation (the call that passes exceptions_dict) to only catch network-related exceptions (e.g., requests.exceptions.RequestException, requests.exceptions.ConnectionError, requests.exceptions.Timeout and any urllib3 connection exceptions your TimeoutSampler expects) instead of Exception, leaving TimeoutExpiredError handling intact; ensure the change is applied where TimeoutSampler is used with func=lambda: requests.get(...) so failures still surface for AttributeError/TypeError while transient network errors are retried and the RuntimeError raising that references evalhub_mt_route.host remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/model_explainability/evalhub/multitenancy/conftest.py`:
- Around line 114-125: The TimeoutSampler call currently uses
exceptions_dict={Exception: []}, which is too broad and hides programming
errors; update the TimeoutSampler invocation (the call that passes
exceptions_dict) to only catch network-related exceptions (e.g.,
requests.exceptions.RequestException, requests.exceptions.ConnectionError,
requests.exceptions.Timeout and any urllib3 connection exceptions your
TimeoutSampler expects) instead of Exception, leaving TimeoutExpiredError
handling intact; ensure the change is applied where TimeoutSampler is used with
func=lambda: requests.get(...) so failures still surface for
AttributeError/TypeError while transient network errors are retried and the
RuntimeError raising that references evalhub_mt_route.host remains unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 560dec88-3cee-4b79-b452-66cdc25736aa
📒 Files selected for processing (1)
tests/model_explainability/evalhub/multitenancy/conftest.py
|
/lgtm |
|
Status of building tag latest: success. |
Pull Request
Summary
Related Issues
Please review and indicate how it has been tested
Additional Requirements
Summary by CodeRabbit
New Features
Tests
Chores