fix(lmeval): auto-inject cluster CA bundle for HTTPS endpoints; allow re-run after spec change by SudipSinha · Pull Request #729 · trustyai-explainability/trustyai-service-operator

SudipSinha · 2026-05-11T12:01:20Z

Summary

Fixes two issues in the LMEval operator.

SSL failure on HTTPS endpoints (external routes and cluster-internal services)

When a model is deployed with Make model deployment available through an external route enabled, OpenShift creates an HTTPS ingress signed by the cluster's self-signed CA. Similarly, cluster-internal services (*.svc.cluster.local) use certificates signed by the OpenShift service-serving CA. The lm-eval pod has no knowledge of either CA, so Python's requests library rejects the connection:

ssl.SSLCertVerificationError: certificate verify failed: self-signed certificate in certificate chain

REQUESTS_CA_BUNDLE replaces Python's default trust store entirely, so mounting a single CA source loses all others.

Fix: in handleNewCR and handleResume, if any modelArg named base_url uses an https:// scheme and verify_certificate is not already set, the operator collects CA certificates from multiple well-known cluster sources:

odh-trusted-ca-bundle: ca-bundle.crt and odh-ca-bundle.crt keys (public/system CAs, injected by the RHOAI operator)
openshift-service-ca.crt: service-ca.crt key (service-serving CA for *.svc.cluster.local)

All found PEM data is concatenated into a per-job merged ConfigMap (<jobName>-ca-bundle) with an owner reference to the LMEvalJob for automatic GC. The merged ConfigMap is mounted at /etc/ssl/certs/odh-ca-bundle.crt and REQUESTS_CA_BUNDLE is set to that path. Each source is best-effort — missing ConfigMaps are skipped, but real API errors (Create/Update failures on the merged ConfigMap) are propagated so the reconciler retries.

Completed job cannot be re-run after spec edit

Once a job reaches CompleteJobState (including Failed reason), editing the spec has no effect. The operator does not watch for spec changes on completed jobs, and there is no supported way to reset the status through the status subresource via oc patch.

Fix: handleNewCR and handleResume both write the current metadata.Generation into an annotation (trustyai.opendatahub.io/last-scheduled-generation) each time a pod is created. In Reconcile, a new guard checks completed jobs: if the current generation exceeds the stored value, the completed pod is deleted and the status is fully reset to New (clearing lastScheduleTime, completeTime, podName, reason, message, progressBars, results); the next reconcile cycle re-runs the job. The lastGen > 0 guard means jobs created before this change are never accidentally reset. Recording the annotation in handleResume also prevents a spurious extra re-run when the spec is edited while a job is suspended and then resumed.

Pre-existing bug fix: metrics label cardinality panic

The createJobCreationMetrics function in the controller builds a Prometheus CounterVec with label names derived from the first LMEvalJob it processes. If a subsequent job has different modelArgs (e.g. one has base_url and another doesn't), the label count mismatches and Prometheus panics with inconsistent label cardinality. This bug exists on main but is not surfaced because there is only one LMES test case. Our new CA bundle tests create jobs with varying modelArgs, exposing the panic.

Fix: always initialize both model_name and base_url labels to empty strings before populating from modelArgs, ensuring every job produces the same 6-label set. This is committed separately from the main fixes.

Steps to reproduce

Setup

Provision an OpenShift cluster with GPUs
Install RHOAI 3.4ea2 (or later)
Deploy an LLM (e.g. TinyLlama) with "Make model deployment available through an external route" enabled — this creates an HTTPS ingress signed by the cluster's self-signed CA
Enable evaluations in the UI by setting disableLMEval to false in the OdhDashboardConfig CR

Bug 1: SSL failure on HTTPS endpoints

Create an evaluation run from the RHOAI Dashboard UI targeting the deployed model. The UI generates an LMEvalJob CR with base_url: https://<model>.<namespace>.apps.<cluster-domain>/v1/completions.

The pod starts, downloads the dataset, but fails at the first API request:

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed:
  self-signed certificate in certificate chain (_ssl.c:1004)

The job enters Complete state with reason: Failed.

The RHOAI Dashboard UI provides no mechanism to set verify_certificate or mount custom CA bundles, so there is no workaround from the UI.

Minimal reproducer (CLI):

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: test-ssl
spec:
  allowOnline: true
  model: local-completions
  modelArgs:
  - name: model
    value: tinyllama
  - name: base_url
    value: https://<model>.<namespace>.apps.<cluster-domain>/v1/completions
  taskList:
    taskNames:
    - arc_easy

Any LMEvalJob with an https:// value in base_url triggers the issue. The operator does not inject cluster CA certificates, so the pod has no way to verify the server's certificate.

Note: Our envtest integration tests use model: hf with a synthetic base_url because they only verify the operator's pod-spec injection (volumes, mounts, env vars) — lm-eval never actually runs in envtest.

Bug 2: Completed job cannot be re-run after spec edit

Continuing from Bug 1, the user tries to fix the failed job by editing the spec:

The job is in Complete state with reason: Failed after the SSL error.

Edit the LMEvalJob spec (e.g. change a modelArg):

oc edit lmevaljob test-ssl
# Change the pretrained model value, or add verify_certificate

Nothing happens — the job stays Complete and no new pod is created:
```
$ oc get lmevaljob test-ssl
NAME       STATE
test-ssl   Complete
```

Attempting to reset the status via oc patch has no effect (the status subresource is separate):

$ oc patch lmevaljob test-ssl --type=merge -p '{"status": {"state": "New"}}'
lmevaljob.trustyai.opendatahub.io/test-ssl patched (no change)

Deleting the failed pod does not trigger a new pod creation either.

The only workaround is to delete the entire LMEvalJob and create a new one.

Note: RHOAI Dashboard UI does not expose TLS options (not addressed here)

The RHOAI Dashboard UI creates LMEvalJob CRs with no mechanism for users to set verify_certificate, mount custom CA bundles, or reference cert secrets. This means jobs created from the UI will always fail against external HTTPS endpoints unless the operator handles CA injection automatically.

This PR addresses the common case: when odh-trusted-ca-bundle exists in the namespace (injected by the RHOAI operator), the cluster CA is mounted transparently and UI-created jobs with HTTPS endpoints work without any user intervention.

However, edge cases remain where automatic injection is not sufficient:

Clusters where the RHOAI operator does not inject odh-trusted-ca-bundle into the namespace
Models behind custom (non-cluster) CAs that require a user-supplied certificate path
Users who need to explicitly disable verification (verify_certificate: "False") for testing

These cases still require CLI intervention (oc edit / oc patch on the LMEvalJob spec). Exposing TLS options in the Dashboard UI is a separate concern tracked outside this PR.

Changed files

File	Change
`controllers/lmes/constants.go`	New constants for CA bundle ConfigMap names, volume name, mount path, merged key, and annotation key
`controllers/lmes/lmevaljob_controller.go`	`CreatePod` signature extended with CA params; `findAndMergeCABundle` merges multiple CA sources into per-job ConfigMap; `resolveCABundle` helper with proper error handling (sentinel for no-data vs. API errors); `recordScheduledGeneration` helper; re-run detection with pod cleanup in `Reconcile`; metrics label cardinality fix; RBAC marker updated to `create;update` for ConfigMaps
`controllers/lmes/lmevaljob_controller_test.go`	All direct `CreatePod` calls updated to pass `nil, ""` for the new CA bundle params; unit tests for `hasHTTPSBaseURL`, `hasExplicitVerifyCertificate`, `getLastScheduledGeneration`, `CreatePodWithCABundle`, `CreatePodWithoutCABundle`
`controllers/lmes/lmevaljob_controller_suite_test.go`	Integration tests for CA bundle injection (HTTPS, HTTP negative, verify_certificate negative, merged multi-source), re-run logic (spec change resets, unchanged stays Complete), and re-run + CA integration
`controllers/job_mgr/job_mgr_controller.go`	`CreatePod` call updated for new CA bundle parameters (passes `nil, ""` — tracked in #730)
`config/components/lmes/rbac/manager-rbac.yaml`	Generated RBAC adds `create;update` verbs for ConfigMaps

Testing

Unit / integration tests (envtest)

Tests in controllers/lmes/lmevaljob_controller_suite_test.go run against a controller-runtime envtest environment with real CRDs:

Test	What it verifies
CA injection — HTTPS	Pod created for a job with `base_url: https://...` gets a merged ConfigMap containing the CA data, with the volume, mount at `/etc/ssl/certs/odh-ca-bundle.crt`, and `REQUESTS_CA_BUNDLE` env var
CA injection — merged multi-source	When both `odh-trusted-ca-bundle` and `openshift-service-ca.crt` exist, the merged ConfigMap contains data from both sources
CA injection — HTTP (negative)	Pod created for a job with `base_url: http://...` has no CA volume or env var
CA injection — verify_certificate set (negative)	Pod created for a job with `base_url: https://...` AND `verify_certificate: false` has no CA volume or env var
CA injection — missing ConfigMap (negative)	The "Simple LMEvalJob" test creates a job without any CA ConfigMaps in the namespace — the job proceeds normally without CA injection
Re-run — spec change resets job	A completed job whose spec is edited (bumping `metadata.Generation`) is reset to `New` and a new pod is created; the `last-scheduled-generation` annotation is set
Re-run — unchanged spec stays Complete	A completed job whose spec is not edited remains `Complete` (no spurious re-run)
Re-run + CA integration	An HTTPS job that completes and is re-run after spec change retains its merged CA ConfigMap with correct data

Manual integration tests (live OpenShift cluster)

Operator run locally (go run ./cmd/main.go --enable-services LMES) against a live OpenShift cluster. Both bugs were first reproduced on main, then verified fixed on the PR branch. Each test created an LMEvalJob CR and inspected the resulting pod spec and job status.

An odh-trusted-ca-bundle ConfigMap was pre-created in the test namespace to simulate the RHOAI-managed CA bundle.

Test	From `main`	From PR branch	Pass
HTTPS job pod has CA volume, mount, and `REQUESTS_CA_BUNDLE`	Not present	Present	✅
HTTP job pod has no CA injection	N/A	Not present	✅
HTTPS + explicit `verify_certificate` has no CA injection	N/A	Not present	✅
Completed job resets to `New` after spec edit	Stays `Complete`	Resets → new pod	✅
Completed job stays `Complete` when spec is unchanged	Stays `Complete`	Stays `Complete`	✅
`last-scheduled-generation` annotation set on schedule	N/A	Set to `1`	✅

All tests passed on both envtest and live cluster.

CI status

Check	Status	Notes
`build` (unit + integration tests)	✅ Pass
`deploy`	✅ Pass
`ci/prow/images`	✅ Pass
`lintAllTheThings`	✅ Pass
`Gosec Security Scanner`	✅ Pass
`ci/prow/trustyai-service-operator-e2e`	❌ Fail	Pre-existing infrastructure failure. Failing on all recent PRs including merged #731 and #726. Not related to this PR's changes.
`Trivy Security Scan`	❌ Fail	Pre-existing dependency vulnerabilities. `go.opentelemetry.io/otel` (CVE-2026-29181) and `urllib3` in `tests/Pipfile.lock` (CVE-2026-44431, CVE-2026-44432). Failing on all PRs including `main`. Not related to this PR's changes.

🤖 Generated with Claude Code

…n on spec change Bug 1 (RHOAIENG-60487): LMEval jobs created against models with external HTTPS routes fail with SSLCertVerificationError because the lm-eval pod has no cluster CA in its trust store. Fix: when any modelArg base_url uses https:// and verify_certificate is not already set, look up the standard RHOAI ConfigMap odh-trusted-ca-bundle in the job namespace, mount it into the pod, and set REQUESTS_CA_BUNDLE so Python's requests library picks it up automatically. The lookup is best-effort; missing ConfigMap is logged and the job proceeds unchanged. Bug 3 (RHOAIENG-60487): Once a job reaches CompleteJobState (including Failed reason) it is impossible to re-run it by editing the spec, because the operator ignores spec changes on completed jobs. Fix: record metadata.Generation in an annotation (trustyai.opendatahub.io/ last-scheduled-generation) whenever a pod is created. On each reconcile, if a completed job's current generation exceeds the stored value the status is reset to New and the job re-runs with the updated configuration. The lastGen > 0 guard means jobs that predate this change are never accidentally reset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-11T12:01:42Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
✅ Review completed - (🔄 Check again to review again)

📝 Walkthrough

Walkthrough

Adds cluster CA bundle auto-injection into lm-eval pods for HTTPS model endpoints (when verify_certificate is unset) and enables re-execution of completed LMEvalJobs when their spec changes by recording and comparing scheduled generation annotations.

Changes

CA Bundle Injection and Job Rescheduling

Layer / File(s)	Summary
Constants for CA Bundle and Generation Tracking `controllers/lmes/constants.go`	Adds exported constants for trusted/service CA ConfigMap names, merged CA key/suffix, `CABundleVolumeName`, `CABundleMountPath`, and `LastScheduledGenerationAnnotation`.
Imports for ConfigMap errors `controllers/lmes/lmevaljob_controller.go`	Adds `k8s.io/apimachinery/pkg/api/errors` import used for ConfigMap-not-found detection in CA merge logic.
Helper functions (URL, verification, generation, CA merge) `controllers/lmes/lmevaljob_controller.go`	Implements `hasHTTPSBaseURL`, `hasExplicitVerifyCertificate`, `getLastScheduledGeneration`, and CA merge/find helpers that create-or-update a per-job merged ConfigMap owned by the job.
Reconcile: completion reset on spec edit `controllers/lmes/lmevaljob_controller.go`	When a `Complete` job's `Generation` exceeds the recorded annotation, deletes the completed pod and resets completion-related `job.Status` fields to allow re-run.
Job creation metrics label initialization `controllers/lmes/lmevaljob_controller.go`	Ensures `model_name` and `base_url` labels are always initialized to empty strings before extracting model args.
Initial scheduling: CA bundle lookup and annotate generation `controllers/lmes/lmevaljob_controller.go`	`handleNewCR` detects HTTPS `base_url` without explicit `verify_certificate`, merges CA data into a per-job ConfigMap, records `job.Generation` to `LastScheduledGenerationAnnotation`, and passes CA inputs to `CreatePod`.
Resume scheduling: CA bundle lookup and annotate generation `controllers/lmes/lmevaljob_controller.go`	`handleResume` applies the same HTTPS detection and CA merge, updates the annotation if needed, and passes CA inputs to `CreatePod`.
CreatePod signature update `controllers/lmes/lmevaljob_controller.go`	Updates `CreatePod` signature to accept `caBundle *corev1.ConfigMap` and `caBundleKey string`.
CreatePod CA mount and env `controllers/lmes/lmevaljob_controller.go`	When CA bundle inputs are provided, mounts the ConfigMap (single-file `subPath`) at `CABundleMountPath` and sets `REQUESTS_CA_BUNDLE`.
Job manager callsite update `controllers/job_mgr/job_mgr_controller.go`	`PodSets()` now calls `lmes.CreatePod` with `nil` for the new CA bundle parameter slot.
Test updates and new CA/generation tests `controllers/lmes/lmevaljob_controller_test.go`, `controllers/lmes/lmevaljob_controller_suite_test.go`	Updates all `CreatePod` test call sites to the new signature and adds unit/integration tests for HTTPS detection, explicit `verify_certificate`, last-scheduled-generation parsing, CA bundle mount/env behavior, CA merge contents, and re-run after spec change.

Sequence Diagram(s)

sequenceDiagram
  participant Controller
  participant ConfigMapAPI as "k8s API / ConfigMap"
  participant CreatePod
  participant PodAPI as "k8s API / Pod"

  Controller->>ConfigMapAPI: read odh-trusted/service-ca ConfigMaps
  ConfigMapAPI-->>Controller: PEM data (if present)
  Controller->>ConfigMapAPI: create-or-update merged per-job ConfigMap (merged PEM)
  ConfigMapAPI-->>Controller: merged ConfigMap
  Controller->>CreatePod: CreatePod(..., caBundle, caBundleKey)
  CreatePod->>PodAPI: create Pod with ConfigMap volume (subPath) and env REQUESTS_CA_BUNDLE
  PodAPI-->>CreatePod: Pod created

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

#730 — PodSets() still calls CreatePod with nil/"" causing Kueue pod spec to omit injected CA bundle; this PR adds CA bundle parameters and updates PodSets() callsite, addressing that mismatch.

Possibly related PRs

trustyai-explainability/trustyai-service-operator#633 — Adds the per-EvalHub “service CA” ConfigMap and related constants used by the CA-bundle merge/injection flow.

Suggested Labels

lgtm

Suggested Reviewers

mariusdanciu
ruivieira

Poem

🐰 I stitched PEMs into a single map,

mounted it snug within the app,
When specs jump the job will wake,
HTTPS calls now pass the handshake,
Hooray — the pods all hop and clap.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title concisely describes the two main changes: auto-injecting cluster CA bundles for HTTPS endpoints and allowing job re-runs after spec changes. Both changes are directly supported by the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

controllers/lmes/lmevaljob_controller_test.go (1)
191-191: ⚡ Quick win

Add tests for the new CA-injection and generation-tracking paths.

These call-site updates keep the suite compiling, but there is still no coverage for the new behavior: HTTPS base_url + implicit CA injection, explicit verify_certificate bypass, missing ConfigMap fallback, and resume-after-spec-change generation tracking. The resume bug above would slip through as-is.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/lmes/lmevaljob_controller_test.go` at line 191, Add unit tests in
lmevaljob_controller_test.go around the CreatePod call to cover the new
CA-injection and generation-tracking code paths: add cases where
job.Spec.BaseURL is HTTPS and VerifyCertificate is omitted to assert implicit CA
injection occurs, a case with VerifyCertificate explicitly set to false to
assert certificate verification is bypassed, a case simulating a missing
ConfigMap to exercise the fallback logic, and a resume scenario where job
generation changes to assert the controller tracks and resumes correctly after
spec changes; reference the CreatePod(...) call and NewDefaultPermissionConfig()
usage to locate where to insert these table-driven subtests and assert the
expected Pod spec/annotations and controller state transitions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/lmes/lmevaljob_controller.go`:
- Around line 192-213: In the block that detects a spec change for a completed
job (the if where job.Status.State == lmesv1alpha1.CompleteJobState and lastGen
:= getLastScheduledGeneration(job) > 0 && job.Generation > lastGen), delete the
completed Pod (lookup by job.Status.PodName and namespace job.Namespace) before
resetting the job status and calling r.Status().Update; ignore NotFound errors
and handle AlreadyExists/other errors appropriately (return error on unexpected
failures), then clear job.Status.PodName and proceed with resetting fields and
the status update so a new Pod can be created without an AlreadyExists conflict.
- Around line 825-835: When resuming runs in handleResume the controller
recreates pods via CreatePod(Options, job, permConfig, caBundle, caBundleKey,
log) but does not update the LastScheduledGenerationAnnotation, causing a stale
scheduled-generation and triggering an unnecessary second run; modify
handleResume to record the current Job.Generation into the job's annotations
under LastScheduledGenerationAnnotation (the same annotation name used during
initial scheduling) and persist the update (e.g., via the existing client
update/patch helper) immediately after recreating the pod so the resumed run and
the annotation stay in sync.

---

Nitpick comments:
In `@controllers/lmes/lmevaljob_controller_test.go`:
- Line 191: Add unit tests in lmevaljob_controller_test.go around the CreatePod
call to cover the new CA-injection and generation-tracking code paths: add cases
where job.Spec.BaseURL is HTTPS and VerifyCertificate is omitted to assert
implicit CA injection occurs, a case with VerifyCertificate explicitly set to
false to assert certificate verification is bypassed, a case simulating a
missing ConfigMap to exercise the fallback logic, and a resume scenario where
job generation changes to assert the controller tracks and resumes correctly
after spec changes; reference the CreatePod(...) call and
NewDefaultPermissionConfig() usage to locate where to insert these table-driven
subtests and assert the expected Pod spec/annotations and controller state
transitions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c20923c5-7e91-4728-bb4b-9713b8d82b0b

📥 Commits

Reviewing files that changed from the base of the PR and between d4d151a and 02759ab.

📒 Files selected for processing (3)

controllers/lmes/constants.go
controllers/lmes/lmevaljob_controller.go
controllers/lmes/lmevaljob_controller_test.go

When a job is suspended then resumed, handleResume now records the current spec generation in the LastScheduledGenerationAnnotation before creating the pod. Without this, editing the spec while a job is suspended and then resuming it would leave a stale annotation, causing Reconcile to detect a false generation change and trigger an extra re-run after the resumed job completes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…and re-run logic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

controllers/lmes/lmevaljob_controller_suite_test.go (1)

449-459: ⚡ Quick win

Assert the exact reset contract, not just “not Complete”.

Line 455 only verifies the state changed away from Complete; this can pass even if reset behavior regresses. Assert the expected post-reset state (New) and key cleared fields from the contract.

Suggested tightening

-        if job.Status.State == lmesv1alpha1.CompleteJobState {
-            return fmt.Errorf("job is still Complete, waiting for re-run reset")
-        }
-        return nil
+        if job.Status.State != lmesv1alpha1.NewJobState {
+            return fmt.Errorf("expected New after reset, got %s", job.Status.State)
+        }
+        if job.Status.PodName != "" {
+            return fmt.Errorf("expected PodName to be cleared, got %q", job.Status.PodName)
+        }
+        return nil

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/lmes/lmevaljob_controller_suite_test.go` around lines 449 - 459,
The test currently only checks that job.Status.State is not
lmesv1alpha1.CompleteJobState after the spec change; instead assert the exact
reset contract by waiting for job.Status.State to equal the expected reset state
(e.g., lmesv1alpha1.NewJobState) using the same WaitFor helper, and additionally
assert key cleared fields on the job Status (for example ensure job.Status.RunID
is empty/nil and job.Status.StartTime/FinishTime are zero or unset) to guarantee
the reset cleared runtime metadata; locate the check around WaitFor, job and
job.Status.State to implement these assertions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/lmes/lmevaljob_controller_suite_test.go`:
- Around line 506-511: The closure passed to Consistently is ignoring errors
from k8sClient.Get which can hide transient failures; change it to check the Get
result (e.g., call k8sClient.Get and assert Expect(err).Should(Succeed()) inside
the closure or return a non-matching sentinel state on error) so failures don't
get masked—update the closure around k8sClient.Get(...) that returns
job.Status.State (used with Consistently) to either fail the test on Get error
or convert the error into a distinct JobState that will not equal
lmesv1alpha1.CompleteJobState.

---

Nitpick comments:
In `@controllers/lmes/lmevaljob_controller_suite_test.go`:
- Around line 449-459: The test currently only checks that job.Status.State is
not lmesv1alpha1.CompleteJobState after the spec change; instead assert the
exact reset contract by waiting for job.Status.State to equal the expected reset
state (e.g., lmesv1alpha1.NewJobState) using the same WaitFor helper, and
additionally assert key cleared fields on the job Status (for example ensure
job.Status.RunID is empty/nil and job.Status.StartTime/FinishTime are zero or
unset) to guarantee the reset cleared runtime metadata; locate the check around
WaitFor, job and job.Status.State to implement these assertions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dcd337eb-5b0c-4f3e-8d99-9d5ca0257b03

📥 Commits

Reviewing files that changed from the base of the PR and between 02759ab and bb1e0a6.

📒 Files selected for processing (3)

controllers/lmes/lmevaljob_controller.go
controllers/lmes/lmevaljob_controller_suite_test.go
controllers/lmes/lmevaljob_controller_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

controllers/lmes/lmevaljob_controller.go

The Consistently assertion was silently discarding k8sClient.Get errors, which could mask test infrastructure failures as false passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

The CreatePod signature was extended with caBundle and caBundleKey parameters but this call site was not updated, causing a build failure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/job_mgr/job_mgr_controller.go`:
- Line 123: The Pod spec used for Kueue scheduling omits CA bundle volumes
because CreatePod is being called with nil/"" for caBundle and caBundleKey in
job_mgr_controller.go; update the call to lmes.CreatePod(lmes.Options,
&job.LMEvalJob, permConfig, ...) to pass the proper ConfigMap reference and key
for the CA bundle (non-nil caBundle and non-empty caBundleKey) so PodSets()
includes the volumes/mounts and REQUESTS_CA_BUNDLE env var, or alternatively
adjust the PodSets() construction path to inject the same CA bundle
volumes/mounts/env when CA will be added at actual pod creation; locate the call
site of CreatePod and supply the same ConfigMap name/key used elsewhere (or
refactor PodSets/CreatePod to accept a shared caBundle value) so scheduling spec
matches the real pod.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28a67dff-aba6-40ca-9535-90913674190e

📥 Commits

Reviewing files that changed from the base of the PR and between bb1e0a6 and 5c843c1.

📒 Files selected for processing (2)

controllers/job_mgr/job_mgr_controller.go
controllers/lmes/lmevaljob_controller_suite_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

controllers/lmes/lmevaljob_controller_suite_test.go

The metrics code registers a CounterVec with label names from the first LMEvalJob it sees. If a subsequent job has a different set of modelArgs (e.g. one has base_url and another doesn't), the label count mismatches and Prometheus panics with "inconsistent label cardinality". Fix by always initializing both model_name and base_url labels to empty strings before populating from modelArgs, ensuring every job produces the same 6-label set. This is a pre-existing bug on main that was not surfaced because there was only one LMES test case. The new CA bundle tests in this PR create jobs with varying modelArgs, exposing the panic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

saichandrapandraju · 2026-05-12T22:09:54Z

Hi @SudipSinha ,

I tested on my cluster with quay.io/rh-ee-spandraj/tas-operator:pr-729.

The CA injection mechanism works — the volume, volume mount, and REQUESTS_CA_BUNDLE env var are all correctly added to the pod. However, the job still fails with SSLCertVerificationError when the model endpoint is a cluster-internal service using OpenShift service-serving certificates (e.g. https://llm-predictor.evals.svc.cluster.local:8443).
Root cause: The service cert is signed by the OpenShift service-serving CA, which lives in the openshift-service-ca.crt ConfigMap (key: service-ca.crt), not in odh-trusted-ca-bundle. In my namespace, odh-trusted-ca-bundle has:
- ca-bundle.crt (224KB) — public/system CAs only, no service-serving CA
- odh-ca-bundle.crt — empty (0 bytes)

So findCABundle picks ca-bundle.crt, which doesn't contain the CA that signed the service cert. Additionally, REQUESTS_CA_BUNDLE replaces Python's default trust store rather than appending to it. If we fix the CA lookup to use the service-serving CA, we'd lose trust in public CAs.

Maybe: For cluster-internal svc.cluster.local HTTPS endpoints, we can consider also checking openshift-service-ca.crt or service-signing-ca ConfigMaps, and combining the service CA with the system bundle into a merged file so both public and cluster-internal CAs are trusted.

openshift-ci · 2026-05-13T02:37:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ruivieira

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-05-13T09:36:40Z

New changes are detected. LGTM label has been removed.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

controllers/lmes/lmevaljob_controller.go (1)

1875-1888: ⚡ Quick win

Use controllerutil.SetControllerReference() for the merged ConfigMap.

This manually builds the owner reference for the new ConfigMap instead of using the controller-runtime helper the repo standardizes on. Switching to controllerutil.SetControllerReference() keeps ownership wiring consistent and avoids drifting fields across controller-created resources.

Suggested refactor

 		mergedCM = &corev1.ConfigMap{
 			ObjectMeta: v1.ObjectMeta{
 				Name:      mergedCMName,
 				Namespace: job.Namespace,
-				OwnerReferences: []v1.OwnerReference{
-					{
-						APIVersion: job.APIVersion,
-						Kind:       job.Kind,
-						Name:       job.Name,
-						Controller: &ownerRefController,
-						UID:        job.UID,
-					},
-				},
 			},
 			Data: map[string]string{
 				MergedCABundleKey: merged,
 			},
 		}
+		if err := controllerutil.SetControllerReference(job, mergedCM, r.Scheme); err != nil {
+			return nil, "", fmt.Errorf("failed to set owner reference on merged CA ConfigMap: %w", err)
+		}
 		if err := r.Create(ctx, mergedCM); err != nil {
 			return nil, "", fmt.Errorf("failed to create merged CA ConfigMap: %w", err)
 		}

As per coding guidelines, "Use owner references via controllerutil.SetControllerReference() for resource cleanup and garbage collection".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/lmes/lmevaljob_controller.go` around lines 1875 - 1888, The code
manually constructs OwnerReferences for the merged ConfigMap (mergedCM) using
ownerRefController and job fields; replace this manual wiring by creating the
ConfigMap without OwnerReferences and then call
controllerutil.SetControllerReference(job, mergedCM, r.Scheme) (using the
reconciler's scheme) to set ownership; ensure you import controllerutil and
handle the error returned by SetControllerReference before creating/updating the
ConfigMap so ownership is set consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/lmes/lmevaljob_controller.go`:
- Around line 567-579: The current reconcile branch around
hasHTTPSBaseURL/hasExplicitVerifyCertificate calls findAndMergeCABundle and
treats any error as "no CA bundle", which hides real Create/Update failures;
change the logic in the block that calls r.findAndMergeCABundle (and the
analogous code in handleResume) to inspect the returned error: if the error is
the specific "no source CA data" sentinel (or error type/value that indicates no
data) then log and proceed as now, but for any other error return that error (or
requeue by returning ctrl.Result{}, err) so reconcile will retry; keep assigning
caBundle/caBundleKey only on nil error. Reference findAndMergeCABundle,
hasHTTPSBaseURL, hasExplicitVerifyCertificate, and handleResume when locating
the code to change.
- Around line 1894-1904: The controller now creates and updates a per-job
ConfigMap (see mergedCM and the code paths that call r.Create and r.Update to
write MergedCABundleKey), but the kubebuilder RBAC markers only grant
get/watch/list for ConfigMaps; add a kubebuilder RBAC marker to permit create,
update (and patch/delete if you want full lifecycle) on ConfigMaps so the
generated ClusterRole allows the controller to write the merged CA bundle
in-cluster (e.g. add a line like
//+kubebuilder:rbac:groups="",resources=configmaps,verbs=get;list;watch;create;update;patch;delete
next to the existing RBAC markers in the lmevaljob controller file).

---

Nitpick comments:
In `@controllers/lmes/lmevaljob_controller.go`:
- Around line 1875-1888: The code manually constructs OwnerReferences for the
merged ConfigMap (mergedCM) using ownerRefController and job fields; replace
this manual wiring by creating the ConfigMap without OwnerReferences and then
call controllerutil.SetControllerReference(job, mergedCM, r.Scheme) (using the
reconciler's scheme) to set ownership; ensure you import controllerutil and
handle the error returned by SetControllerReference before creating/updating the
ConfigMap so ownership is set consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: efa2e1cf-2c90-4571-a7d3-1b9b8c2a8f9f

📥 Commits

Reviewing files that changed from the base of the PR and between be2e76c and 554d56a.

📒 Files selected for processing (3)

controllers/lmes/constants.go
controllers/lmes/lmevaljob_controller.go
controllers/lmes/lmevaljob_controller_suite_test.go

✅ Files skipped from review due to trivial changes (1)

controllers/lmes/constants.go

🚧 Files skipped from review as they are similar to previous changes (1)

controllers/lmes/lmevaljob_controller_suite_test.go

The previous CA injection mounted a single key from odh-trusted-ca-bundle, which contains only public/system CAs. Cluster-internal services using OpenShift service-serving certificates (*.svc.cluster.local) are signed by a different CA in the openshift-service-ca.crt ConfigMap, so the pod still got SSLCertVerificationError. Additionally, REQUESTS_CA_BUNDLE replaces Python's default trust store rather than appending, so mounting only one CA source loses trust in all others. Fix: replace findCABundle with findAndMergeCABundle, which collects PEM data from both odh-trusted-ca-bundle and openshift-service-ca.crt (best-effort, each skipped if absent), concatenates them, and creates a per-job merged ConfigMap (<jobName>-ca-bundle) with an owner reference for automatic GC. The pod mounts this merged ConfigMap so REQUESTS_CA_BUNDLE contains all relevant CAs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/lmes/lmevaljob_controller_suite_test.go`:
- Around line 337-340: The Gomega assertion uses a printf-style format string
which Gomega won't process; change the Expect call that checks
apierrors.IsAlreadyExists(err) after creating svcCAConfigMap to build the
message with fmt.Sprintf (e.g. fmt.Sprintf("unexpected error creating service CA
ConfigMap: %v", err)) and pass that single string as the failure message, or
alternatively assert on err directly (e.g.
Expect(err).ToNot(HaveOccurred()))—update the
Expect(apierrors.IsAlreadyExists(err)).To(BeTrue(), ...) invocation accordingly.
- Around line 615-619: The Gomega assertion passes a printf-style format and
args to To(), which won't process format verbs; change the assertion so the
message is a single formatted string — e.g., when checking
apierrors.IsAlreadyExists(err) after k8sClient.Create(ctx, caConfigMap), wrap
the message with fmt.Sprintf (or build the string) and pass that single string
to Expect(...).To(BeTrue(), <formatted message>), referencing the existing call
sites: k8sClient.Create, caConfigMap, apierrors.IsAlreadyExists, Expect and
BeTrue.
- Around line 230-233: The Gomega assertion uses a format verb in the message
which won't be expanded; update the Expect call that wraps k8sClient.Create and
apierrors.IsAlreadyExists to pass a fully formatted string (e.g. use
fmt.Sprintf("unexpected error creating CA ConfigMap: %v", err)) as the second
argument to To, and add an import for fmt if it's not already present; target
the Expect(...) invocation that calls apierrors.IsAlreadyExists(err) and the
surrounding k8sClient.Create error branch.
- Around line 323-326: The Gomega assertion passes a format verb to To() which
doesn't process fmt verbs; change the call that checks the result of
k8sClient.Create for odhCAConfigMap from
Expect(apierrors.IsAlreadyExists(err)).To(BeTrue(), "unexpected error creating
ODH CA ConfigMap: %v", err) to build the message beforehand (e.g. use
fmt.Sprintf) and pass the final string:
Expect(apierrors.IsAlreadyExists(err)).To(BeTrue(), fmt.Sprintf("unexpected
error creating ODH CA ConfigMap: %v", err)); remember to add the fmt import to
the test file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 207e1fa1-39ea-4f0a-a500-53d9bf5778cd

📥 Commits

Reviewing files that changed from the base of the PR and between be2e76c and 28f145c.

📒 Files selected for processing (5)

config/components/lmes/rbac/manager-rbac.yaml
controllers/lmes/constants.go
controllers/lmes/lmevaljob_controller.go
controllers/lmes/lmevaljob_controller_suite_test.go
controllers/lmes/lmevaljob_controller_test.go

✅ Files skipped from review due to trivial changes (2)

config/components/lmes/rbac/manager-rbac.yaml
controllers/lmes/constants.go

🚧 Files skipped from review as they are similar to previous changes (2)

controllers/lmes/lmevaljob_controller.go
controllers/lmes/lmevaljob_controller_test.go

openshift-ci · 2026-05-13T12:41:43Z

@SudipSinha: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/trustyai-service-operator-e2e	`28f145c`	link	true	`/test trustyai-service-operator-e2e`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

SudipSinha self-assigned this May 11, 2026

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread controllers/lmes/lmevaljob_controller.go Outdated

Comment thread controllers/lmes/lmevaljob_controller.go Outdated

SudipSinha and others added 2 commits May 11, 2026 12:18

test(lmeval): add unit and integration tests for CA bundle injection …

bb1e0a6

…and re-run logic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread controllers/lmes/lmevaljob_controller_suite_test.go Outdated

SudipSinha added 2 commits May 11, 2026 14:42

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread controllers/job_mgr/job_mgr_controller.go

coderabbitai Bot mentioned this pull request May 11, 2026

fix(lmeval): align Kueue PodSets() spec with actual pod when CA bundle injection is active #730

Open

ruivieira self-requested a review May 13, 2026 02:36

ruivieira approved these changes May 13, 2026

View reviewed changes

openshift-ci Bot assigned ruivieira May 13, 2026

openshift-ci Bot added the lgtm label May 13, 2026

SudipSinha added the kind/bug Something isn't working label May 13, 2026

openshift-ci Bot removed the lgtm label May 13, 2026

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread controllers/lmes/lmevaljob_controller.go Outdated

Comment thread controllers/lmes/lmevaljob_controller.go

SudipSinha force-pushed the fix/lmeval-https-ssl-rerun branch 2 times, most recently from 89f7544 to c2b8257 Compare May 13, 2026 10:41

SudipSinha force-pushed the fix/lmeval-https-ssl-rerun branch from c2b8257 to 28f145c Compare May 13, 2026 10:50

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread controllers/lmes/lmevaljob_controller_suite_test.go

Comment thread controllers/lmes/lmevaljob_controller_suite_test.go

Comment thread controllers/lmes/lmevaljob_controller_suite_test.go

Comment thread controllers/lmes/lmevaljob_controller_suite_test.go

SudipSinha merged commit b50a816 into trustyai-explainability:main May 14, 2026
8 of 10 checks passed

SudipSinha deleted the fix/lmeval-https-ssl-rerun branch May 14, 2026 07:54

Conversation

SudipSinha commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

SSL failure on HTTPS endpoints (external routes and cluster-internal services)

Completed job cannot be re-run after spec edit

Pre-existing bug fix: metrics label cardinality panic

Steps to reproduce

Setup

Bug 1: SSL failure on HTTPS endpoints

Bug 2: Completed job cannot be re-run after spec edit

Note: RHOAI Dashboard UI does not expose TLS options (not addressed here)

Changed files

Testing

Unit / integration tests (envtest)

Manual integration tests (live OpenShift cluster)

CI status

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Possibly related issues

Possibly related PRs

Suggested Labels

Suggested Reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saichandrapandraju commented May 12, 2026

Uh oh!

openshift-ci Bot commented May 13, 2026

Uh oh!

openshift-ci Bot commented May 13, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SudipSinha commented May 11, 2026 •

edited

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading