Skip to content

feat: Add garak benchmark tests for EvalHub with KFP provider #1361

Merged
dbasunag merged 12 commits intoopendatahub-io:mainfrom
sheltoncyril:garak-kfp
Apr 7, 2026
Merged

feat: Add garak benchmark tests for EvalHub with KFP provider #1361
dbasunag merged 12 commits intoopendatahub-io:mainfrom
sheltoncyril:garak-kfp

Conversation

@sheltoncyril
Copy link
Copy Markdown
Contributor

@sheltoncyril sheltoncyril commented Apr 6, 2026

Pull Request

Summary

  • Add end-to-end tests for running garak security evaluations via EvalHub with the garak-kfp provider
  • Tests use an LLM-d inference simulator instead of a real model to avoid GPU requirements
  • Add EvalHub and MLflow custom resource wrappers in utilities/resources/

Related Issues

  • Fixes:
  • JIRA:

Please review and indicate how it has been tested

  • Locally
  • Jenkins

Additional Requirements

  • If this PR introduces a new test image, did you create a PR to mirror it in disconnected environment?
  • If this PR introduces new marker(s)/adds a new component, was relevant ticket created to update relevant Jenkins job?

Summary by CodeRabbit

  • New Features

    • MLflow resource support, EvalHub provider/config options, and a new max-model-length setting.
  • Tests

    • New end-to-end Garak EvalHub benchmark tests (submission, polling, completion).
    • Expanded multi-tenant EvalHub fixtures: tenant RBAC/user flows, DSPA/MinIO setup, MLflow test instance, simulated inference-service helpers, job/ServiceAccount utilities, and improved provider validation.
  • Chores

    • Pinned MinIO container image and added garak benchmarking constants.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

The following are automatically added/executed:

  • PR size label.
  • Run pre-commit
  • Run tox
  • Add PR author as the PR assignee
  • Build image based on the PR

Available user actions:

  • To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
  • To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
  • To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
    lgtm label removed on each new commit push.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
    and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
  • To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
    pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.
Supported labels

{'/lgtm', '/cherry-pick', '/hold', '/verified', '/build-push-pr-image', '/wip'}

@sheltoncyril sheltoncyril marked this pull request as ready for review April 6, 2026 18:08
@sheltoncyril sheltoncyril requested review from a team and kpunwatk as code owners April 6, 2026 18:08
@sheltoncyril sheltoncyril force-pushed the garak-kfp branch 3 times, most recently from 4e1a83a to b0ac8bf Compare April 6, 2026 18:37
@sheltoncyril sheltoncyril changed the title [WIP] feat: Add garak benchmark tests for EvalHub with KFP provider feat: Add garak benchmark tests for EvalHub with KFP provider Apr 6, 2026
Copy link
Copy Markdown
Contributor

@saichandrapandraju saichandrapandraju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
sheltoncyril and others added 4 commits April 6, 2026 20:04
Add end-to-end tests for running garak security evaluations via EvalHub
with the garak-kfp provider. Tests verify EvalHub health, provider
availability, job submission, and job completion using an LLM-d inference
simulator.

- Add test_garak.py with TestGarakBenchmark test class
- Add EvalHub and MLflow custom resource wrappers
- Add conftest fixtures for EvalHub CR, MLflow, tenant namespace, DSPA,
  RBAC, and inference simulator setup/teardown
- Add utility functions for EvalHub API interactions (health, providers,
  job submission, polling)
- Add patched_dsc_garak_kfp fixture for DSC configuration
- Add evalhub constants for API paths, provider IDs, and tenant labels

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
…simulator

Two bugs were found and fixed while investigating garak scan failures
against the llm-d inference simulator in the KFP execution path:

**Fix 1: Scan timeout due to wrong service port in model URL**

The `garak_sim_isvc_url` fixture was building the model URL with the
container port (8032) directly. KServe creates a headless Service that
maps port 80 → 8032, but with headless services the port mapping is not
applied via kube-proxy — connections go directly to the pod IP. When the
KFP garak pod (in the tenant namespace) tried to reach the model via the
Service DNS name on port 80, it got "Connection refused", causing garak
to hang in its backoff-retry loop until the 600s scan timeout was hit.

Fix: remove the explicit port from the URL so it defaults to port 80
(the Service's exposed port), which resolves correctly via DNS to the
pod IP and port 8032 for direct pod-to-pod traffic.

**Fix 2: Garak scan completes but all probes report SKIP ok on 0/0**

After fixing the timeout, scans completed but every detector result was
SKIP with 0 attempts evaluated. Root cause: the llm-d inference
simulator defaults to a 1024-token context window, but the DAN probe
prompt is ~1067 tokens. With max_tokens=150 for the completion, the
total (1217) exceeded the limit and the simulator returned HTTP 400.
Garak catches BadRequestError and returns None for that generation,
which propagates to the detector as a None result, and the evaluator
reports SKIP when passes + fails == 0.

Fix: pass `--max-model-len 8192` to the simulator via a new
`max_model_len` field on `LLMdInferenceSimConfig`, giving the simulator
enough context to handle all garak probe prompts.

**Fix 3: Race condition — model pod not ready before scan submission**

The `create_isvc` call uses `wait_for_predictor_pods=False` because the
standard KServe wait helper does not support RawDeployment mode. This
could allow the garak job to be submitted before the simulator pod was
serving, causing immediate connection errors.

Fix: add a `garak_sim_isvc_ready` fixture that explicitly waits for the
predictor Deployment to have ready replicas before the test proceeds,
and wire it into `test_submit_garak_job`.

All 4 tests now pass end-to-end (health, providers, submit, completion).

Made-with: Cursor
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
- Remove duplicate test_lmeval_huggingface_model_tier2 function
- Fix bare Exception catch in wait_for_service_account
- Auto-format fixes from ruff

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
dbasunag
dbasunag previously approved these changes Apr 6, 2026
Copy link
Copy Markdown
Collaborator

@dbasunag dbasunag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a big PR, won't hold it for the import comment, but please address it in a follow up PR.

Signed-off-by: Shelton Cyril <sheltoncyril@gmail.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/model_explainability/evalhub/multitenancy/conftest.py (1)

102-126: Consider narrowing the exception catch for better diagnostics.

Line 119: exceptions_dict={Exception: []} suppresses all exceptions during retry, which could mask configuration issues (e.g., persistent SSL certificate errors, DNS resolution failures). Using requests.exceptions.RequestException would still allow retries on transient network failures while surfacing unexpected errors.

♻️ Suggested refinement
+from requests.exceptions import RequestException
+
 `@pytest.fixture`(scope="class")
 def evalhub_mt_ready(
     evalhub_mt_route: Route,
     evalhub_mt_ca_bundle_file: str,
 ) -> None:
     ...
     try:
         for sample in TimeoutSampler(
             wait_timeout=120,
             sleep=5,
             func=lambda: requests.get(url, verify=evalhub_mt_ca_bundle_file, timeout=10),
-            exceptions_dict={Exception: []},
+            exceptions_dict={RequestException: []},
         ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_explainability/evalhub/multitenancy/conftest.py` around lines 102
- 126, The TimeoutSampler in the evalhub_mt_ready fixture currently swallows all
exceptions via exceptions_dict={Exception: []}; narrow this to only
network/request errors by replacing Exception with
requests.exceptions.RequestException in the TimeoutSampler call (and add the
requests.exceptions import if missing) so transient network/SSL/DNS errors are
retried but unexpected exceptions are not suppressed; keep the rest of the
polling logic (url, verify, timeout) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/model_explainability/evalhub/multitenancy/conftest.py`:
- Around line 102-126: The TimeoutSampler in the evalhub_mt_ready fixture
currently swallows all exceptions via exceptions_dict={Exception: []}; narrow
this to only network/request errors by replacing Exception with
requests.exceptions.RequestException in the TimeoutSampler call (and add the
requests.exceptions import if missing) so transient network/SSL/DNS errors are
retried but unexpected exceptions are not suppressed; keep the rest of the
polling logic (url, verify, timeout) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0563c3a4-d3ff-4f61-9c29-89a849823393

📥 Commits

Reviewing files that changed from the base of the PR and between 1eb73d2 and b8487fa.

📒 Files selected for processing (1)
  • tests/model_explainability/evalhub/multitenancy/conftest.py

@sheltoncyril sheltoncyril requested review from SB159 and dbasunag April 7, 2026 12:28
@sheltoncyril
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/model_explainability/evalhub/multitenancy/conftest.py (1)

114-125: Overly broad exception suppression masks programming errors.

exceptions_dict={Exception: []} catches all exceptions including AttributeError, TypeError, and other programming errors that indicate bugs rather than transient network issues. This can hide legitimate failures and make debugging difficult.

Narrow the scope to network-related exceptions:

♻️ Proposed fix
+from requests.exceptions import RequestException
+
 ...
 
     try:
         for sample in TimeoutSampler(
             wait_timeout=120,
             sleep=5,
             func=lambda: requests.get(url, verify=evalhub_mt_ca_bundle_file, timeout=10),
-            exceptions_dict={Exception: []},
+            exceptions_dict={RequestException: []},
         ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/model_explainability/evalhub/multitenancy/conftest.py` around lines 114
- 125, The TimeoutSampler call currently uses exceptions_dict={Exception: []},
which is too broad and hides programming errors; update the TimeoutSampler
invocation (the call that passes exceptions_dict) to only catch network-related
exceptions (e.g., requests.exceptions.RequestException,
requests.exceptions.ConnectionError, requests.exceptions.Timeout and any urllib3
connection exceptions your TimeoutSampler expects) instead of Exception, leaving
TimeoutExpiredError handling intact; ensure the change is applied where
TimeoutSampler is used with func=lambda: requests.get(...) so failures still
surface for AttributeError/TypeError while transient network errors are retried
and the RuntimeError raising that references evalhub_mt_route.host remains
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/model_explainability/evalhub/multitenancy/conftest.py`:
- Around line 114-125: The TimeoutSampler call currently uses
exceptions_dict={Exception: []}, which is too broad and hides programming
errors; update the TimeoutSampler invocation (the call that passes
exceptions_dict) to only catch network-related exceptions (e.g.,
requests.exceptions.RequestException, requests.exceptions.ConnectionError,
requests.exceptions.Timeout and any urllib3 connection exceptions your
TimeoutSampler expects) instead of Exception, leaving TimeoutExpiredError
handling intact; ensure the change is applied where TimeoutSampler is used with
func=lambda: requests.get(...) so failures still surface for
AttributeError/TypeError while transient network errors are retried and the
RuntimeError raising that references evalhub_mt_route.host remains unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 560dec88-3cee-4b79-b452-66cdc25736aa

📥 Commits

Reviewing files that changed from the base of the PR and between 1eb73d2 and aac8b2f.

📒 Files selected for processing (1)
  • tests/model_explainability/evalhub/multitenancy/conftest.py

Copy link
Copy Markdown
Contributor

@SB159 SB159 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mwaykole
Copy link
Copy Markdown
Member

mwaykole commented Apr 7, 2026

/lgtm

@sheltoncyril sheltoncyril enabled auto-merge (squash) April 7, 2026 12:50
@dbasunag dbasunag disabled auto-merge April 7, 2026 12:51
@dbasunag dbasunag merged commit c59380c into opendatahub-io:main Apr 7, 2026
14 checks passed
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants