feat: Add garak benchmark tests for EvalHub with KFP provider by sheltoncyril · Pull Request #1361 · opendatahub-io/opendatahub-tests

sheltoncyril · 2026-04-06T14:26:39Z

Pull Request

Summary

Add end-to-end tests for running garak security evaluations via EvalHub with the garak-kfp provider
Tests use an LLM-d inference simulator instead of a real model to avoid GPU requirements
Add EvalHub and MLflow custom resource wrappers in utilities/resources/

Related Issues

Fixes:
JIRA:

Please review and indicate how it has been tested

Locally
Jenkins

Additional Requirements

If this PR introduces a new test image, did you create a PR to mirror it in disconnected environment?
If this PR introduces new marker(s)/adds a new component, was relevant ticket created to update relevant Jenkins job?

Summary by CodeRabbit

New Features
- MLflow resource support, EvalHub provider/config options, and a new max-model-length setting.
Tests
- New end-to-end Garak EvalHub benchmark tests (submission, polling, completion).
- Expanded multi-tenant EvalHub fixtures: tenant RBAC/user flows, DSPA/MinIO setup, MLflow test instance, simulated inference-service helpers, job/ServiceAccount utilities, and improved provider validation.
Chores
- Pinned MinIO container image and added garak benchmarking constants.

github-actions · 2026-04-06T14:27:01Z

The following are automatically added/executed:

PR size label.
Run pre-commit
Run tox
Add PR author as the PR assignee
Build image based on the PR

Available user actions:

To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
lgtm label removed on each new commit push.
To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
verified label removed on each new commit push.
To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.

Supported labels

{'/lgtm', '/cherry-pick', '/hold', '/verified', '/build-push-pr-image', '/wip'}

saichandrapandraju

/lgtm

tests/model_explainability/evalhub/conftest.py

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

tests/model_explainability/evalhub/utils.py

Add end-to-end tests for running garak security evaluations via EvalHub with the garak-kfp provider. Tests verify EvalHub health, provider availability, job submission, and job completion using an LLM-d inference simulator. - Add test_garak.py with TestGarakBenchmark test class - Add EvalHub and MLflow custom resource wrappers - Add conftest fixtures for EvalHub CR, MLflow, tenant namespace, DSPA, RBAC, and inference simulator setup/teardown - Add utility functions for EvalHub API interactions (health, providers, job submission, polling) - Add patched_dsc_garak_kfp fixture for DSC configuration - Add evalhub constants for API paths, provider IDs, and tenant labels Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

…simulator Two bugs were found and fixed while investigating garak scan failures against the llm-d inference simulator in the KFP execution path: **Fix 1: Scan timeout due to wrong service port in model URL** The `garak_sim_isvc_url` fixture was building the model URL with the container port (8032) directly. KServe creates a headless Service that maps port 80 → 8032, but with headless services the port mapping is not applied via kube-proxy — connections go directly to the pod IP. When the KFP garak pod (in the tenant namespace) tried to reach the model via the Service DNS name on port 80, it got "Connection refused", causing garak to hang in its backoff-retry loop until the 600s scan timeout was hit. Fix: remove the explicit port from the URL so it defaults to port 80 (the Service's exposed port), which resolves correctly via DNS to the pod IP and port 8032 for direct pod-to-pod traffic. **Fix 2: Garak scan completes but all probes report SKIP ok on 0/0** After fixing the timeout, scans completed but every detector result was SKIP with 0 attempts evaluated. Root cause: the llm-d inference simulator defaults to a 1024-token context window, but the DAN probe prompt is ~1067 tokens. With max_tokens=150 for the completion, the total (1217) exceeded the limit and the simulator returned HTTP 400. Garak catches BadRequestError and returns None for that generation, which propagates to the detector as a None result, and the evaluator reports SKIP when passes + fails == 0. Fix: pass `--max-model-len 8192` to the simulator via a new `max_model_len` field on `LLMdInferenceSimConfig`, giving the simulator enough context to handle all garak probe prompts. **Fix 3: Race condition — model pod not ready before scan submission** The `create_isvc` call uses `wait_for_predictor_pods=False` because the standard KServe wait helper does not support RawDeployment mode. This could allow the garak job to be submitted before the simulator pod was serving, causing immediate connection errors. Fix: add a `garak_sim_isvc_ready` fixture that explicitly waits for the predictor Deployment to have ready replicas before the test proceeds, and wire it into `test_submit_garak_job`. All 4 tests now pass end-to-end (health, providers, submit, completion). Made-with: Cursor Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

- Remove duplicate test_lmeval_huggingface_model_tier2 function - Fix bare Exception catch in wait_for_service_account - Auto-format fixes from ruff Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

tests/model_explainability/evalhub/conftest.py

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

dbasunag

Since this is a big PR, won't hold it for the import comment, but please address it in a follow up PR.

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

Addressed comments

coderabbitai

🧹 Nitpick comments (1)

tests/model_explainability/evalhub/multitenancy/conftest.py (1)

102-126: Consider narrowing the exception catch for better diagnostics.

Line 119: exceptions_dict={Exception: []} suppresses all exceptions during retry, which could mask configuration issues (e.g., persistent SSL certificate errors, DNS resolution failures). Using requests.exceptions.RequestException would still allow retries on transient network failures while surfacing unexpected errors.

♻️ Suggested refinement

+from requests.exceptions import RequestException
+
 `@pytest.fixture`(scope="class")
 def evalhub_mt_ready(
     evalhub_mt_route: Route,
     evalhub_mt_ca_bundle_file: str,
 ) -> None:
     ...
     try:
         for sample in TimeoutSampler(
             wait_timeout=120,
             sleep=5,
             func=lambda: requests.get(url, verify=evalhub_mt_ca_bundle_file, timeout=10),
-            exceptions_dict={Exception: []},
+            exceptions_dict={RequestException: []},
         ):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/model_explainability/evalhub/multitenancy/conftest.py` around lines 102
- 126, The TimeoutSampler in the evalhub_mt_ready fixture currently swallows all
exceptions via exceptions_dict={Exception: []}; narrow this to only
network/request errors by replacing Exception with
requests.exceptions.RequestException in the TimeoutSampler call (and add the
requests.exceptions import if missing) so transient network/SSL/DNS errors are
retried but unexpected exceptions are not suppressed; keep the rest of the
polling logic (url, verify, timeout) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/model_explainability/evalhub/multitenancy/conftest.py`:
- Around line 102-126: The TimeoutSampler in the evalhub_mt_ready fixture
currently swallows all exceptions via exceptions_dict={Exception: []}; narrow
this to only network/request errors by replacing Exception with
requests.exceptions.RequestException in the TimeoutSampler call (and add the
requests.exceptions import if missing) so transient network/SSL/DNS errors are
retried but unexpected exceptions are not suppressed; keep the rest of the
polling logic (url, verify, timeout) unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0563c3a4-d3ff-4f61-9c29-89a849823393

📥 Commits

Reviewing files that changed from the base of the PR and between 1eb73d2 and b8487fa.

📒 Files selected for processing (1)

tests/model_explainability/evalhub/multitenancy/conftest.py

sheltoncyril · 2026-04-07T12:29:11Z

@coderabbitai review

coderabbitai · 2026-04-07T12:29:49Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (1)

tests/model_explainability/evalhub/multitenancy/conftest.py (1)

114-125: Overly broad exception suppression masks programming errors.

exceptions_dict={Exception: []} catches all exceptions including AttributeError, TypeError, and other programming errors that indicate bugs rather than transient network issues. This can hide legitimate failures and make debugging difficult.

Narrow the scope to network-related exceptions:

♻️ Proposed fix

+from requests.exceptions import RequestException
+
 ...
 
     try:
         for sample in TimeoutSampler(
             wait_timeout=120,
             sleep=5,
             func=lambda: requests.get(url, verify=evalhub_mt_ca_bundle_file, timeout=10),
-            exceptions_dict={Exception: []},
+            exceptions_dict={RequestException: []},
         ):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/model_explainability/evalhub/multitenancy/conftest.py` around lines 114
- 125, The TimeoutSampler call currently uses exceptions_dict={Exception: []},
which is too broad and hides programming errors; update the TimeoutSampler
invocation (the call that passes exceptions_dict) to only catch network-related
exceptions (e.g., requests.exceptions.RequestException,
requests.exceptions.ConnectionError, requests.exceptions.Timeout and any urllib3
connection exceptions your TimeoutSampler expects) instead of Exception, leaving
TimeoutExpiredError handling intact; ensure the change is applied where
TimeoutSampler is used with func=lambda: requests.get(...) so failures still
surface for AttributeError/TypeError while transient network errors are retried
and the RuntimeError raising that references evalhub_mt_route.host remains
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/model_explainability/evalhub/multitenancy/conftest.py`:
- Around line 114-125: The TimeoutSampler call currently uses
exceptions_dict={Exception: []}, which is too broad and hides programming
errors; update the TimeoutSampler invocation (the call that passes
exceptions_dict) to only catch network-related exceptions (e.g.,
requests.exceptions.RequestException, requests.exceptions.ConnectionError,
requests.exceptions.Timeout and any urllib3 connection exceptions your
TimeoutSampler expects) instead of Exception, leaving TimeoutExpiredError
handling intact; ensure the change is applied where TimeoutSampler is used with
func=lambda: requests.get(...) so failures still surface for
AttributeError/TypeError while transient network errors are retried and the
RuntimeError raising that references evalhub_mt_route.host remains unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 560dec88-3cee-4b79-b452-66cdc25736aa

📥 Commits

Reviewing files that changed from the base of the PR and between 1eb73d2 and aac8b2f.

📒 Files selected for processing (1)

tests/model_explainability/evalhub/multitenancy/conftest.py

SB159

/lgtm

mwaykole · 2026-04-07T12:47:18Z

/lgtm

github-actions · 2026-04-07T12:52:56Z

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

github-actions bot added ModelExplainability Utilities labels Apr 6, 2026

github-actions bot assigned sheltoncyril Apr 6, 2026

github-actions bot added the size/xl label Apr 6, 2026

sheltoncyril marked this pull request as ready for review April 6, 2026 18:08

sheltoncyril requested review from a team and kpunwatk as code owners April 6, 2026 18:08

sheltoncyril force-pushed the garak-kfp branch 3 times, most recently from 4e1a83a to b0ac8bf Compare April 6, 2026 18:37

sheltoncyril changed the title ~~[WIP] feat: Add garak benchmark tests for EvalHub with KFP provider~~ feat: Add garak benchmark tests for EvalHub with KFP provider Apr 6, 2026

saichandrapandraju approved these changes Apr 6, 2026

View reviewed changes

SB159 reviewed Apr 6, 2026

View reviewed changes

tests/model_explainability/evalhub/conftest.py Show resolved Hide resolved

feat: add smoke and t1 markers

d15019a

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

SB159 reviewed Apr 6, 2026

View reviewed changes

tests/model_explainability/evalhub/utils.py Outdated Show resolved Hide resolved

sheltoncyril and others added 4 commits April 6, 2026 20:04

fix: test for Garak fix

5d41ac7

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

fix: resolve pre-commit issues after rebase

60cd680

- Remove duplicate test_lmeval_huggingface_model_tier2 function - Fix bare Exception catch in wait_for_service_account - Auto-format fixes from ruff Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

sheltoncyril force-pushed the garak-kfp branch from b0ac8bf to 60cd680 Compare April 6, 2026 19:05

SB159 reviewed Apr 6, 2026

View reviewed changes

tests/model_explainability/evalhub/conftest.py Outdated Show resolved Hide resolved

rhods-ci-bot added the commented-by-SB159 label Apr 6, 2026

SB159 requested changes Apr 6, 2026

View reviewed changes

rhods-ci-bot added changes-requested-by-SB159 commented-by-sheltoncyril labels Apr 6, 2026

fix: remove duplicate database param in EvalHub and fix except syntax

22f7cae

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

rhods-ci-bot removed commented-by-sheltoncyril commented-by-SB159 changes-requested-by-SB159 labels Apr 6, 2026

rhods-ci-bot added the commented-by-dbasunag label Apr 6, 2026

dbasunag previously approved these changes Apr 6, 2026

View reviewed changes

rhods-ci-bot added the lgtm-by-dbasunag label Apr 6, 2026

fix: move requests import to top of file in multitenancy conftest

b8487fa

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>

sheltoncyril dismissed dbasunag’s stale review via b8487fa April 7, 2026 10:18

rhods-ci-bot removed commented-by-dbasunag lgtm-by-dbasunag commented-by-SB159 changes-requested-by-SB159 labels Apr 7, 2026

Merge branch 'main' into garak-kfp

aac8b2f

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

sheltoncyril requested review from SB159 and dbasunag April 7, 2026 12:28

rhods-ci-bot added the commented-by-sheltoncyril label Apr 7, 2026

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

dbasunag approved these changes Apr 7, 2026

View reviewed changes

rhods-ci-bot added the lgtm-by-dbasunag label Apr 7, 2026

SB159 approved these changes Apr 7, 2026

View reviewed changes

rhods-ci-bot added the lgtm-by-SB159 label Apr 7, 2026

rhods-ci-bot approved these changes Apr 7, 2026

View reviewed changes

rhods-ci-bot added lgtm-by-mwaykole lgtm-by-rhods-ci-bot labels Apr 7, 2026

sheltoncyril enabled auto-merge (squash) April 7, 2026 12:50

dbasunag disabled auto-merge April 7, 2026 12:51

dbasunag merged commit c59380c into opendatahub-io:main Apr 7, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add garak benchmark tests for EvalHub with KFP provider #1361

feat: Add garak benchmark tests for EvalHub with KFP provider #1361
dbasunag merged 12 commits intoopendatahub-io:mainfrom
sheltoncyril:garak-kfp

sheltoncyril commented Apr 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

saichandrapandraju left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dbasunag left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

sheltoncyril commented Apr 7, 2026

Uh oh!

coderabbitai bot commented Apr 7, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

SB159 left a comment

Uh oh!

mwaykole commented Apr 7, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

sheltoncyril commented Apr 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Summary

Related Issues

Please review and indicate how it has been tested

Additional Requirements

Summary by CodeRabbit

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

saichandrapandraju left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dbasunag left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sheltoncyril commented Apr 7, 2026

Uh oh!

coderabbitai bot commented Apr 7, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

SB159 left a comment

Choose a reason for hiding this comment

Uh oh!

mwaykole commented Apr 7, 2026

Uh oh!

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sheltoncyril commented Apr 6, 2026 •

edited by coderabbitai bot

Loading