Skip to content

fix(lmeval): auto-inject cluster CA bundle for HTTPS endpoints; allow re-run after spec change#729

Merged
SudipSinha merged 7 commits into
trustyai-explainability:mainfrom
SudipSinha:fix/lmeval-https-ssl-rerun
May 14, 2026
Merged

fix(lmeval): auto-inject cluster CA bundle for HTTPS endpoints; allow re-run after spec change#729
SudipSinha merged 7 commits into
trustyai-explainability:mainfrom
SudipSinha:fix/lmeval-https-ssl-rerun

Conversation

@SudipSinha
Copy link
Copy Markdown
Member

@SudipSinha SudipSinha commented May 11, 2026

Jira: RHOAIENG-60487

Summary

Fixes two issues in the LMEval operator.

SSL failure on HTTPS endpoints (external routes and cluster-internal services)

When a model is deployed with Make model deployment available through an external route enabled, OpenShift creates an HTTPS ingress signed by the cluster's self-signed CA. Similarly, cluster-internal services (*.svc.cluster.local) use certificates signed by the OpenShift service-serving CA. The lm-eval pod has no knowledge of either CA, so Python's requests library rejects the connection:

ssl.SSLCertVerificationError: certificate verify failed: self-signed certificate in certificate chain

REQUESTS_CA_BUNDLE replaces Python's default trust store entirely, so mounting a single CA source loses all others.

Fix: in handleNewCR and handleResume, if any modelArg named base_url uses an https:// scheme and verify_certificate is not already set, the operator collects CA certificates from multiple well-known cluster sources:

  • odh-trusted-ca-bundle: ca-bundle.crt and odh-ca-bundle.crt keys (public/system CAs, injected by the RHOAI operator)
  • openshift-service-ca.crt: service-ca.crt key (service-serving CA for *.svc.cluster.local)

All found PEM data is concatenated into a per-job merged ConfigMap (<jobName>-ca-bundle) with an owner reference to the LMEvalJob for automatic GC. The merged ConfigMap is mounted at /etc/ssl/certs/odh-ca-bundle.crt and REQUESTS_CA_BUNDLE is set to that path. Each source is best-effort — missing ConfigMaps are skipped, but real API errors (Create/Update failures on the merged ConfigMap) are propagated so the reconciler retries.

Completed job cannot be re-run after spec edit

Once a job reaches CompleteJobState (including Failed reason), editing the spec has no effect. The operator does not watch for spec changes on completed jobs, and there is no supported way to reset the status through the status subresource via oc patch.

Fix: handleNewCR and handleResume both write the current metadata.Generation into an annotation (trustyai.opendatahub.io/last-scheduled-generation) each time a pod is created. In Reconcile, a new guard checks completed jobs: if the current generation exceeds the stored value, the completed pod is deleted and the status is fully reset to New (clearing lastScheduleTime, completeTime, podName, reason, message, progressBars, results); the next reconcile cycle re-runs the job. The lastGen > 0 guard means jobs created before this change are never accidentally reset. Recording the annotation in handleResume also prevents a spurious extra re-run when the spec is edited while a job is suspended and then resumed.

Pre-existing bug fix: metrics label cardinality panic

The createJobCreationMetrics function in the controller builds a Prometheus CounterVec with label names derived from the first LMEvalJob it processes. If a subsequent job has different modelArgs (e.g. one has base_url and another doesn't), the label count mismatches and Prometheus panics with inconsistent label cardinality. This bug exists on main but is not surfaced because there is only one LMES test case. Our new CA bundle tests create jobs with varying modelArgs, exposing the panic.

Fix: always initialize both model_name and base_url labels to empty strings before populating from modelArgs, ensuring every job produces the same 6-label set. This is committed separately from the main fixes.

Steps to reproduce

Setup

  1. Provision an OpenShift cluster with GPUs
  2. Install RHOAI 3.4ea2 (or later)
  3. Deploy an LLM (e.g. TinyLlama) with "Make model deployment available through an external route" enabled — this creates an HTTPS ingress signed by the cluster's self-signed CA
  4. Enable evaluations in the UI by setting disableLMEval to false in the OdhDashboardConfig CR

Bug 1: SSL failure on HTTPS endpoints

  1. Create an evaluation run from the RHOAI Dashboard UI targeting the deployed model. The UI generates an LMEvalJob CR with base_url: https://<model>.<namespace>.apps.<cluster-domain>/v1/completions.

  2. The pod starts, downloads the dataset, but fails at the first API request:

    ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed:
      self-signed certificate in certificate chain (_ssl.c:1004)
    
  3. The job enters Complete state with reason: Failed.

The RHOAI Dashboard UI provides no mechanism to set verify_certificate or mount custom CA bundles, so there is no workaround from the UI.

Minimal reproducer (CLI):

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: test-ssl
spec:
  allowOnline: true
  model: local-completions
  modelArgs:
  - name: model
    value: tinyllama
  - name: base_url
    value: https://<model>.<namespace>.apps.<cluster-domain>/v1/completions
  taskList:
    taskNames:
    - arc_easy

Any LMEvalJob with an https:// value in base_url triggers the issue. The operator does not inject cluster CA certificates, so the pod has no way to verify the server's certificate.

Note: Our envtest integration tests use model: hf with a synthetic base_url because they only verify the operator's pod-spec injection (volumes, mounts, env vars) — lm-eval never actually runs in envtest.

Bug 2: Completed job cannot be re-run after spec edit

Continuing from Bug 1, the user tries to fix the failed job by editing the spec:

  1. The job is in Complete state with reason: Failed after the SSL error.

  2. Edit the LMEvalJob spec (e.g. change a modelArg):

    oc edit lmevaljob test-ssl
    # Change the pretrained model value, or add verify_certificate
  3. Nothing happens — the job stays Complete and no new pod is created:

    $ oc get lmevaljob test-ssl
    NAME       STATE
    test-ssl   Complete
  4. Attempting to reset the status via oc patch has no effect (the status subresource is separate):

    $ oc patch lmevaljob test-ssl --type=merge -p '{"status": {"state": "New"}}'
    lmevaljob.trustyai.opendatahub.io/test-ssl patched (no change)
  5. Deleting the failed pod does not trigger a new pod creation either.

The only workaround is to delete the entire LMEvalJob and create a new one.

Note: RHOAI Dashboard UI does not expose TLS options (not addressed here)

The RHOAI Dashboard UI creates LMEvalJob CRs with no mechanism for users to set verify_certificate, mount custom CA bundles, or reference cert secrets. This means jobs created from the UI will always fail against external HTTPS endpoints unless the operator handles CA injection automatically.

This PR addresses the common case: when odh-trusted-ca-bundle exists in the namespace (injected by the RHOAI operator), the cluster CA is mounted transparently and UI-created jobs with HTTPS endpoints work without any user intervention.

However, edge cases remain where automatic injection is not sufficient:

  • Clusters where the RHOAI operator does not inject odh-trusted-ca-bundle into the namespace
  • Models behind custom (non-cluster) CAs that require a user-supplied certificate path
  • Users who need to explicitly disable verification (verify_certificate: "False") for testing

These cases still require CLI intervention (oc edit / oc patch on the LMEvalJob spec). Exposing TLS options in the Dashboard UI is a separate concern tracked outside this PR.

Changed files

File Change
controllers/lmes/constants.go New constants for CA bundle ConfigMap names, volume name, mount path, merged key, and annotation key
controllers/lmes/lmevaljob_controller.go CreatePod signature extended with CA params; findAndMergeCABundle merges multiple CA sources into per-job ConfigMap; resolveCABundle helper with proper error handling (sentinel for no-data vs. API errors); recordScheduledGeneration helper; re-run detection with pod cleanup in Reconcile; metrics label cardinality fix; RBAC marker updated to create;update for ConfigMaps
controllers/lmes/lmevaljob_controller_test.go All direct CreatePod calls updated to pass nil, "" for the new CA bundle params; unit tests for hasHTTPSBaseURL, hasExplicitVerifyCertificate, getLastScheduledGeneration, CreatePodWithCABundle, CreatePodWithoutCABundle
controllers/lmes/lmevaljob_controller_suite_test.go Integration tests for CA bundle injection (HTTPS, HTTP negative, verify_certificate negative, merged multi-source), re-run logic (spec change resets, unchanged stays Complete), and re-run + CA integration
controllers/job_mgr/job_mgr_controller.go CreatePod call updated for new CA bundle parameters (passes nil, "" — tracked in #730)
config/components/lmes/rbac/manager-rbac.yaml Generated RBAC adds create;update verbs for ConfigMaps

Testing

Unit / integration tests (envtest)

Tests in controllers/lmes/lmevaljob_controller_suite_test.go run against a controller-runtime envtest environment with real CRDs:

Test What it verifies
CA injection — HTTPS Pod created for a job with base_url: https://... gets a merged ConfigMap containing the CA data, with the volume, mount at /etc/ssl/certs/odh-ca-bundle.crt, and REQUESTS_CA_BUNDLE env var
CA injection — merged multi-source When both odh-trusted-ca-bundle and openshift-service-ca.crt exist, the merged ConfigMap contains data from both sources
CA injection — HTTP (negative) Pod created for a job with base_url: http://... has no CA volume or env var
CA injection — verify_certificate set (negative) Pod created for a job with base_url: https://... AND verify_certificate: false has no CA volume or env var
CA injection — missing ConfigMap (negative) The "Simple LMEvalJob" test creates a job without any CA ConfigMaps in the namespace — the job proceeds normally without CA injection
Re-run — spec change resets job A completed job whose spec is edited (bumping metadata.Generation) is reset to New and a new pod is created; the last-scheduled-generation annotation is set
Re-run — unchanged spec stays Complete A completed job whose spec is not edited remains Complete (no spurious re-run)
Re-run + CA integration An HTTPS job that completes and is re-run after spec change retains its merged CA ConfigMap with correct data

Manual integration tests (live OpenShift cluster)

Operator run locally (go run ./cmd/main.go --enable-services LMES) against a live OpenShift cluster. Both bugs were first reproduced on main, then verified fixed on the PR branch. Each test created an LMEvalJob CR and inspected the resulting pod spec and job status.

An odh-trusted-ca-bundle ConfigMap was pre-created in the test namespace to simulate the RHOAI-managed CA bundle.

Test From main From PR branch Pass
HTTPS job pod has CA volume, mount, and REQUESTS_CA_BUNDLE Not present Present
HTTP job pod has no CA injection N/A Not present
HTTPS + explicit verify_certificate has no CA injection N/A Not present
Completed job resets to New after spec edit Stays Complete Resets → new pod
Completed job stays Complete when spec is unchanged Stays Complete Stays Complete
last-scheduled-generation annotation set on schedule N/A Set to 1

All tests passed on both envtest and live cluster.

CI status

Check Status Notes
build (unit + integration tests) ✅ Pass
deploy ✅ Pass
ci/prow/images ✅ Pass
lintAllTheThings ✅ Pass
Gosec Security Scanner ✅ Pass
ci/prow/trustyai-service-operator-e2e ❌ Fail Pre-existing infrastructure failure. Failing on all recent PRs including merged #731 and #726. Not related to this PR's changes.
Trivy Security Scan ❌ Fail Pre-existing dependency vulnerabilities. go.opentelemetry.io/otel (CVE-2026-29181) and urllib3 in tests/Pipfile.lock (CVE-2026-44431, CVE-2026-44432). Failing on all PRs including main. Not related to this PR's changes.

🤖 Generated with Claude Code

…n on spec change

Bug 1 (RHOAIENG-60487): LMEval jobs created against models with external
HTTPS routes fail with SSLCertVerificationError because the lm-eval pod
has no cluster CA in its trust store. Fix: when any modelArg base_url uses
https:// and verify_certificate is not already set, look up the standard
RHOAI ConfigMap odh-trusted-ca-bundle in the job namespace, mount it into
the pod, and set REQUESTS_CA_BUNDLE so Python's requests library picks it
up automatically. The lookup is best-effort; missing ConfigMap is logged
and the job proceeds unchanged.

Bug 3 (RHOAIENG-60487): Once a job reaches CompleteJobState (including
Failed reason) it is impossible to re-run it by editing the spec, because
the operator ignores spec changes on completed jobs. Fix: record
metadata.Generation in an annotation (trustyai.opendatahub.io/
last-scheduled-generation) whenever a pod is created. On each reconcile,
if a completed job's current generation exceeds the stored value the status
is reset to New and the job re-runs with the updated configuration. The
lastGen > 0 guard means jobs that predate this change are never
accidentally reset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

Adds cluster CA bundle auto-injection into lm-eval pods for HTTPS model endpoints (when verify_certificate is unset) and enables re-execution of completed LMEvalJobs when their spec changes by recording and comparing scheduled generation annotations.

Changes

CA Bundle Injection and Job Rescheduling

Layer / File(s) Summary
Constants for CA Bundle and Generation Tracking
controllers/lmes/constants.go
Adds exported constants for trusted/service CA ConfigMap names, merged CA key/suffix, CABundleVolumeName, CABundleMountPath, and LastScheduledGenerationAnnotation.
Imports for ConfigMap errors
controllers/lmes/lmevaljob_controller.go
Adds k8s.io/apimachinery/pkg/api/errors import used for ConfigMap-not-found detection in CA merge logic.
Helper functions (URL, verification, generation, CA merge)
controllers/lmes/lmevaljob_controller.go
Implements hasHTTPSBaseURL, hasExplicitVerifyCertificate, getLastScheduledGeneration, and CA merge/find helpers that create-or-update a per-job merged ConfigMap owned by the job.
Reconcile: completion reset on spec edit
controllers/lmes/lmevaljob_controller.go
When a Complete job's Generation exceeds the recorded annotation, deletes the completed pod and resets completion-related job.Status fields to allow re-run.
Job creation metrics label initialization
controllers/lmes/lmevaljob_controller.go
Ensures model_name and base_url labels are always initialized to empty strings before extracting model args.
Initial scheduling: CA bundle lookup and annotate generation
controllers/lmes/lmevaljob_controller.go
handleNewCR detects HTTPS base_url without explicit verify_certificate, merges CA data into a per-job ConfigMap, records job.Generation to LastScheduledGenerationAnnotation, and passes CA inputs to CreatePod.
Resume scheduling: CA bundle lookup and annotate generation
controllers/lmes/lmevaljob_controller.go
handleResume applies the same HTTPS detection and CA merge, updates the annotation if needed, and passes CA inputs to CreatePod.
CreatePod signature update
controllers/lmes/lmevaljob_controller.go
Updates CreatePod signature to accept caBundle *corev1.ConfigMap and caBundleKey string.
CreatePod CA mount and env
controllers/lmes/lmevaljob_controller.go
When CA bundle inputs are provided, mounts the ConfigMap (single-file subPath) at CABundleMountPath and sets REQUESTS_CA_BUNDLE.
Job manager callsite update
controllers/job_mgr/job_mgr_controller.go
PodSets() now calls lmes.CreatePod with nil for the new CA bundle parameter slot.
Test updates and new CA/generation tests
controllers/lmes/lmevaljob_controller_test.go, controllers/lmes/lmevaljob_controller_suite_test.go
Updates all CreatePod test call sites to the new signature and adds unit/integration tests for HTTPS detection, explicit verify_certificate, last-scheduled-generation parsing, CA bundle mount/env behavior, CA merge contents, and re-run after spec change.

Sequence Diagram(s)

sequenceDiagram
  participant Controller
  participant ConfigMapAPI as "k8s API / ConfigMap"
  participant CreatePod
  participant PodAPI as "k8s API / Pod"

  Controller->>ConfigMapAPI: read odh-trusted/service-ca ConfigMaps
  ConfigMapAPI-->>Controller: PEM data (if present)
  Controller->>ConfigMapAPI: create-or-update merged per-job ConfigMap (merged PEM)
  ConfigMapAPI-->>Controller: merged ConfigMap
  Controller->>CreatePod: CreatePod(..., caBundle, caBundleKey)
  CreatePod->>PodAPI: create Pod with ConfigMap volume (subPath) and env REQUESTS_CA_BUNDLE
  PodAPI-->>CreatePod: Pod created
Loading

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

  • #730 — PodSets() still calls CreatePod with nil/"" causing Kueue pod spec to omit injected CA bundle; this PR adds CA bundle parameters and updates PodSets() callsite, addressing that mismatch.

Possibly related PRs

Suggested Labels

lgtm

Suggested Reviewers

  • mariusdanciu
  • ruivieira

Poem

🐰 I stitched PEMs into a single map,

mounted it snug within the app,
When specs jump the job will wake,
HTTPS calls now pass the handshake,
Hooray — the pods all hop and clap.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely describes the two main changes: auto-injecting cluster CA bundles for HTTPS endpoints and allowing job re-runs after spec changes. Both changes are directly supported by the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@SudipSinha SudipSinha self-assigned this May 11, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
controllers/lmes/lmevaljob_controller_test.go (1)

191-191: ⚡ Quick win

Add tests for the new CA-injection and generation-tracking paths.

These call-site updates keep the suite compiling, but there is still no coverage for the new behavior: HTTPS base_url + implicit CA injection, explicit verify_certificate bypass, missing ConfigMap fallback, and resume-after-spec-change generation tracking. The resume bug above would slip through as-is.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/lmes/lmevaljob_controller_test.go` at line 191, Add unit tests in
lmevaljob_controller_test.go around the CreatePod call to cover the new
CA-injection and generation-tracking code paths: add cases where
job.Spec.BaseURL is HTTPS and VerifyCertificate is omitted to assert implicit CA
injection occurs, a case with VerifyCertificate explicitly set to false to
assert certificate verification is bypassed, a case simulating a missing
ConfigMap to exercise the fallback logic, and a resume scenario where job
generation changes to assert the controller tracks and resumes correctly after
spec changes; reference the CreatePod(...) call and NewDefaultPermissionConfig()
usage to locate where to insert these table-driven subtests and assert the
expected Pod spec/annotations and controller state transitions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/lmes/lmevaljob_controller.go`:
- Around line 192-213: In the block that detects a spec change for a completed
job (the if where job.Status.State == lmesv1alpha1.CompleteJobState and lastGen
:= getLastScheduledGeneration(job) > 0 && job.Generation > lastGen), delete the
completed Pod (lookup by job.Status.PodName and namespace job.Namespace) before
resetting the job status and calling r.Status().Update; ignore NotFound errors
and handle AlreadyExists/other errors appropriately (return error on unexpected
failures), then clear job.Status.PodName and proceed with resetting fields and
the status update so a new Pod can be created without an AlreadyExists conflict.
- Around line 825-835: When resuming runs in handleResume the controller
recreates pods via CreatePod(Options, job, permConfig, caBundle, caBundleKey,
log) but does not update the LastScheduledGenerationAnnotation, causing a stale
scheduled-generation and triggering an unnecessary second run; modify
handleResume to record the current Job.Generation into the job's annotations
under LastScheduledGenerationAnnotation (the same annotation name used during
initial scheduling) and persist the update (e.g., via the existing client
update/patch helper) immediately after recreating the pod so the resumed run and
the annotation stay in sync.

---

Nitpick comments:
In `@controllers/lmes/lmevaljob_controller_test.go`:
- Line 191: Add unit tests in lmevaljob_controller_test.go around the CreatePod
call to cover the new CA-injection and generation-tracking code paths: add cases
where job.Spec.BaseURL is HTTPS and VerifyCertificate is omitted to assert
implicit CA injection occurs, a case with VerifyCertificate explicitly set to
false to assert certificate verification is bypassed, a case simulating a
missing ConfigMap to exercise the fallback logic, and a resume scenario where
job generation changes to assert the controller tracks and resumes correctly
after spec changes; reference the CreatePod(...) call and
NewDefaultPermissionConfig() usage to locate where to insert these table-driven
subtests and assert the expected Pod spec/annotations and controller state
transitions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c20923c5-7e91-4728-bb4b-9713b8d82b0b

📥 Commits

Reviewing files that changed from the base of the PR and between d4d151a and 02759ab.

📒 Files selected for processing (3)
  • controllers/lmes/constants.go
  • controllers/lmes/lmevaljob_controller.go
  • controllers/lmes/lmevaljob_controller_test.go

Comment thread controllers/lmes/lmevaljob_controller.go Outdated
Comment thread controllers/lmes/lmevaljob_controller.go Outdated
SudipSinha and others added 2 commits May 11, 2026 12:18
When a job is suspended then resumed, handleResume now records the
current spec generation in the LastScheduledGenerationAnnotation before
creating the pod. Without this, editing the spec while a job is
suspended and then resuming it would leave a stale annotation, causing
Reconcile to detect a false generation change and trigger an extra
re-run after the resumed job completes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and re-run logic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
controllers/lmes/lmevaljob_controller_suite_test.go (1)

449-459: ⚡ Quick win

Assert the exact reset contract, not just “not Complete”.

Line 455 only verifies the state changed away from Complete; this can pass even if reset behavior regresses. Assert the expected post-reset state (New) and key cleared fields from the contract.

Suggested tightening
-        if job.Status.State == lmesv1alpha1.CompleteJobState {
-            return fmt.Errorf("job is still Complete, waiting for re-run reset")
-        }
-        return nil
+        if job.Status.State != lmesv1alpha1.NewJobState {
+            return fmt.Errorf("expected New after reset, got %s", job.Status.State)
+        }
+        if job.Status.PodName != "" {
+            return fmt.Errorf("expected PodName to be cleared, got %q", job.Status.PodName)
+        }
+        return nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/lmes/lmevaljob_controller_suite_test.go` around lines 449 - 459,
The test currently only checks that job.Status.State is not
lmesv1alpha1.CompleteJobState after the spec change; instead assert the exact
reset contract by waiting for job.Status.State to equal the expected reset state
(e.g., lmesv1alpha1.NewJobState) using the same WaitFor helper, and additionally
assert key cleared fields on the job Status (for example ensure job.Status.RunID
is empty/nil and job.Status.StartTime/FinishTime are zero or unset) to guarantee
the reset cleared runtime metadata; locate the check around WaitFor, job and
job.Status.State to implement these assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/lmes/lmevaljob_controller_suite_test.go`:
- Around line 506-511: The closure passed to Consistently is ignoring errors
from k8sClient.Get which can hide transient failures; change it to check the Get
result (e.g., call k8sClient.Get and assert Expect(err).Should(Succeed()) inside
the closure or return a non-matching sentinel state on error) so failures don't
get masked—update the closure around k8sClient.Get(...) that returns
job.Status.State (used with Consistently) to either fail the test on Get error
or convert the error into a distinct JobState that will not equal
lmesv1alpha1.CompleteJobState.

---

Nitpick comments:
In `@controllers/lmes/lmevaljob_controller_suite_test.go`:
- Around line 449-459: The test currently only checks that job.Status.State is
not lmesv1alpha1.CompleteJobState after the spec change; instead assert the
exact reset contract by waiting for job.Status.State to equal the expected reset
state (e.g., lmesv1alpha1.NewJobState) using the same WaitFor helper, and
additionally assert key cleared fields on the job Status (for example ensure
job.Status.RunID is empty/nil and job.Status.StartTime/FinishTime are zero or
unset) to guarantee the reset cleared runtime metadata; locate the check around
WaitFor, job and job.Status.State to implement these assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dcd337eb-5b0c-4f3e-8d99-9d5ca0257b03

📥 Commits

Reviewing files that changed from the base of the PR and between 02759ab and bb1e0a6.

📒 Files selected for processing (3)
  • controllers/lmes/lmevaljob_controller.go
  • controllers/lmes/lmevaljob_controller_suite_test.go
  • controllers/lmes/lmevaljob_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • controllers/lmes/lmevaljob_controller.go

Comment thread controllers/lmes/lmevaljob_controller_suite_test.go Outdated
The Consistently assertion was silently discarding k8sClient.Get errors,
which could mask test infrastructure failures as false passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
The CreatePod signature was extended with caBundle and caBundleKey
parameters but this call site was not updated, causing a build failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/job_mgr/job_mgr_controller.go`:
- Line 123: The Pod spec used for Kueue scheduling omits CA bundle volumes
because CreatePod is being called with nil/"" for caBundle and caBundleKey in
job_mgr_controller.go; update the call to lmes.CreatePod(lmes.Options,
&job.LMEvalJob, permConfig, ...) to pass the proper ConfigMap reference and key
for the CA bundle (non-nil caBundle and non-empty caBundleKey) so PodSets()
includes the volumes/mounts and REQUESTS_CA_BUNDLE env var, or alternatively
adjust the PodSets() construction path to inject the same CA bundle
volumes/mounts/env when CA will be added at actual pod creation; locate the call
site of CreatePod and supply the same ConfigMap name/key used elsewhere (or
refactor PodSets/CreatePod to accept a shared caBundle value) so scheduling spec
matches the real pod.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28a67dff-aba6-40ca-9535-90913674190e

📥 Commits

Reviewing files that changed from the base of the PR and between bb1e0a6 and 5c843c1.

📒 Files selected for processing (2)
  • controllers/job_mgr/job_mgr_controller.go
  • controllers/lmes/lmevaljob_controller_suite_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • controllers/lmes/lmevaljob_controller_suite_test.go

Comment thread controllers/job_mgr/job_mgr_controller.go
The metrics code registers a CounterVec with label names from the first
LMEvalJob it sees. If a subsequent job has a different set of modelArgs
(e.g. one has base_url and another doesn't), the label count mismatches
and Prometheus panics with "inconsistent label cardinality".

Fix by always initializing both model_name and base_url labels to empty
strings before populating from modelArgs, ensuring every job produces
the same 6-label set.

This is a pre-existing bug on main that was not surfaced because there
was only one LMES test case. The new CA bundle tests in this PR create
jobs with varying modelArgs, exposing the panic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@saichandrapandraju
Copy link
Copy Markdown
Contributor

Hi @SudipSinha ,

I tested on my cluster with quay.io/rh-ee-spandraj/tas-operator:pr-729.

  • The CA injection mechanism works — the volume, volume mount, and REQUESTS_CA_BUNDLE env var are all correctly added to the pod. However, the job still fails with SSLCertVerificationError when the model endpoint is a cluster-internal service using OpenShift service-serving certificates (e.g. https://llm-predictor.evals.svc.cluster.local:8443).

  • Root cause: The service cert is signed by the OpenShift service-serving CA, which lives in the openshift-service-ca.crt ConfigMap (key: service-ca.crt), not in odh-trusted-ca-bundle. In my namespace, odh-trusted-ca-bundle has:
    - ca-bundle.crt (224KB) — public/system CAs only, no service-serving CA
    - odh-ca-bundle.crt — empty (0 bytes)

So findCABundle picks ca-bundle.crt, which doesn't contain the CA that signed the service cert. Additionally, REQUESTS_CA_BUNDLE replaces Python's default trust store rather than appending to it. If we fix the CA lookup to use the service-serving CA, we'd lose trust in public CAs.

Maybe: For cluster-internal svc.cluster.local HTTPS endpoints, we can consider also checking openshift-service-ca.crt or service-signing-ca ConfigMaps, and combining the service CA with the system bundle into a merged file so both public and cluster-internal CAs are trusted.

@ruivieira ruivieira self-requested a review May 13, 2026 02:36
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ruivieira

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SudipSinha SudipSinha added the kind/bug Something isn't working label May 13, 2026
@openshift-ci openshift-ci Bot removed the lgtm label May 13, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 13, 2026

New changes are detected. LGTM label has been removed.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
controllers/lmes/lmevaljob_controller.go (1)

1875-1888: ⚡ Quick win

Use controllerutil.SetControllerReference() for the merged ConfigMap.

This manually builds the owner reference for the new ConfigMap instead of using the controller-runtime helper the repo standardizes on. Switching to controllerutil.SetControllerReference() keeps ownership wiring consistent and avoids drifting fields across controller-created resources.

Suggested refactor
 		mergedCM = &corev1.ConfigMap{
 			ObjectMeta: v1.ObjectMeta{
 				Name:      mergedCMName,
 				Namespace: job.Namespace,
-				OwnerReferences: []v1.OwnerReference{
-					{
-						APIVersion: job.APIVersion,
-						Kind:       job.Kind,
-						Name:       job.Name,
-						Controller: &ownerRefController,
-						UID:        job.UID,
-					},
-				},
 			},
 			Data: map[string]string{
 				MergedCABundleKey: merged,
 			},
 		}
+		if err := controllerutil.SetControllerReference(job, mergedCM, r.Scheme); err != nil {
+			return nil, "", fmt.Errorf("failed to set owner reference on merged CA ConfigMap: %w", err)
+		}
 		if err := r.Create(ctx, mergedCM); err != nil {
 			return nil, "", fmt.Errorf("failed to create merged CA ConfigMap: %w", err)
 		}

As per coding guidelines, "Use owner references via controllerutil.SetControllerReference() for resource cleanup and garbage collection".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/lmes/lmevaljob_controller.go` around lines 1875 - 1888, The code
manually constructs OwnerReferences for the merged ConfigMap (mergedCM) using
ownerRefController and job fields; replace this manual wiring by creating the
ConfigMap without OwnerReferences and then call
controllerutil.SetControllerReference(job, mergedCM, r.Scheme) (using the
reconciler's scheme) to set ownership; ensure you import controllerutil and
handle the error returned by SetControllerReference before creating/updating the
ConfigMap so ownership is set consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/lmes/lmevaljob_controller.go`:
- Around line 567-579: The current reconcile branch around
hasHTTPSBaseURL/hasExplicitVerifyCertificate calls findAndMergeCABundle and
treats any error as "no CA bundle", which hides real Create/Update failures;
change the logic in the block that calls r.findAndMergeCABundle (and the
analogous code in handleResume) to inspect the returned error: if the error is
the specific "no source CA data" sentinel (or error type/value that indicates no
data) then log and proceed as now, but for any other error return that error (or
requeue by returning ctrl.Result{}, err) so reconcile will retry; keep assigning
caBundle/caBundleKey only on nil error. Reference findAndMergeCABundle,
hasHTTPSBaseURL, hasExplicitVerifyCertificate, and handleResume when locating
the code to change.
- Around line 1894-1904: The controller now creates and updates a per-job
ConfigMap (see mergedCM and the code paths that call r.Create and r.Update to
write MergedCABundleKey), but the kubebuilder RBAC markers only grant
get/watch/list for ConfigMaps; add a kubebuilder RBAC marker to permit create,
update (and patch/delete if you want full lifecycle) on ConfigMaps so the
generated ClusterRole allows the controller to write the merged CA bundle
in-cluster (e.g. add a line like
//+kubebuilder:rbac:groups="",resources=configmaps,verbs=get;list;watch;create;update;patch;delete
next to the existing RBAC markers in the lmevaljob controller file).

---

Nitpick comments:
In `@controllers/lmes/lmevaljob_controller.go`:
- Around line 1875-1888: The code manually constructs OwnerReferences for the
merged ConfigMap (mergedCM) using ownerRefController and job fields; replace
this manual wiring by creating the ConfigMap without OwnerReferences and then
call controllerutil.SetControllerReference(job, mergedCM, r.Scheme) (using the
reconciler's scheme) to set ownership; ensure you import controllerutil and
handle the error returned by SetControllerReference before creating/updating the
ConfigMap so ownership is set consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: efa2e1cf-2c90-4571-a7d3-1b9b8c2a8f9f

📥 Commits

Reviewing files that changed from the base of the PR and between be2e76c and 554d56a.

📒 Files selected for processing (3)
  • controllers/lmes/constants.go
  • controllers/lmes/lmevaljob_controller.go
  • controllers/lmes/lmevaljob_controller_suite_test.go
✅ Files skipped from review due to trivial changes (1)
  • controllers/lmes/constants.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • controllers/lmes/lmevaljob_controller_suite_test.go

Comment thread controllers/lmes/lmevaljob_controller.go Outdated
Comment thread controllers/lmes/lmevaljob_controller.go
@SudipSinha SudipSinha force-pushed the fix/lmeval-https-ssl-rerun branch 2 times, most recently from 89f7544 to c2b8257 Compare May 13, 2026 10:41
The previous CA injection mounted a single key from odh-trusted-ca-bundle,
which contains only public/system CAs. Cluster-internal services using
OpenShift service-serving certificates (*.svc.cluster.local) are signed by
a different CA in the openshift-service-ca.crt ConfigMap, so the pod still
got SSLCertVerificationError.

Additionally, REQUESTS_CA_BUNDLE replaces Python's default trust store
rather than appending, so mounting only one CA source loses trust in all
others.

Fix: replace findCABundle with findAndMergeCABundle, which collects PEM
data from both odh-trusted-ca-bundle and openshift-service-ca.crt
(best-effort, each skipped if absent), concatenates them, and creates a
per-job merged ConfigMap (<jobName>-ca-bundle) with an owner reference
for automatic GC. The pod mounts this merged ConfigMap so
REQUESTS_CA_BUNDLE contains all relevant CAs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@SudipSinha SudipSinha force-pushed the fix/lmeval-https-ssl-rerun branch from c2b8257 to 28f145c Compare May 13, 2026 10:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/lmes/lmevaljob_controller_suite_test.go`:
- Around line 337-340: The Gomega assertion uses a printf-style format string
which Gomega won't process; change the Expect call that checks
apierrors.IsAlreadyExists(err) after creating svcCAConfigMap to build the
message with fmt.Sprintf (e.g. fmt.Sprintf("unexpected error creating service CA
ConfigMap: %v", err)) and pass that single string as the failure message, or
alternatively assert on err directly (e.g.
Expect(err).ToNot(HaveOccurred()))—update the
Expect(apierrors.IsAlreadyExists(err)).To(BeTrue(), ...) invocation accordingly.
- Around line 615-619: The Gomega assertion passes a printf-style format and
args to To(), which won't process format verbs; change the assertion so the
message is a single formatted string — e.g., when checking
apierrors.IsAlreadyExists(err) after k8sClient.Create(ctx, caConfigMap), wrap
the message with fmt.Sprintf (or build the string) and pass that single string
to Expect(...).To(BeTrue(), <formatted message>), referencing the existing call
sites: k8sClient.Create, caConfigMap, apierrors.IsAlreadyExists, Expect and
BeTrue.
- Around line 230-233: The Gomega assertion uses a format verb in the message
which won't be expanded; update the Expect call that wraps k8sClient.Create and
apierrors.IsAlreadyExists to pass a fully formatted string (e.g. use
fmt.Sprintf("unexpected error creating CA ConfigMap: %v", err)) as the second
argument to To, and add an import for fmt if it's not already present; target
the Expect(...) invocation that calls apierrors.IsAlreadyExists(err) and the
surrounding k8sClient.Create error branch.
- Around line 323-326: The Gomega assertion passes a format verb to To() which
doesn't process fmt verbs; change the call that checks the result of
k8sClient.Create for odhCAConfigMap from
Expect(apierrors.IsAlreadyExists(err)).To(BeTrue(), "unexpected error creating
ODH CA ConfigMap: %v", err) to build the message beforehand (e.g. use
fmt.Sprintf) and pass the final string:
Expect(apierrors.IsAlreadyExists(err)).To(BeTrue(), fmt.Sprintf("unexpected
error creating ODH CA ConfigMap: %v", err)); remember to add the fmt import to
the test file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 207e1fa1-39ea-4f0a-a500-53d9bf5778cd

📥 Commits

Reviewing files that changed from the base of the PR and between be2e76c and 28f145c.

📒 Files selected for processing (5)
  • config/components/lmes/rbac/manager-rbac.yaml
  • controllers/lmes/constants.go
  • controllers/lmes/lmevaljob_controller.go
  • controllers/lmes/lmevaljob_controller_suite_test.go
  • controllers/lmes/lmevaljob_controller_test.go
✅ Files skipped from review due to trivial changes (2)
  • config/components/lmes/rbac/manager-rbac.yaml
  • controllers/lmes/constants.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • controllers/lmes/lmevaljob_controller.go
  • controllers/lmes/lmevaljob_controller_test.go

Comment thread controllers/lmes/lmevaljob_controller_suite_test.go
Comment thread controllers/lmes/lmevaljob_controller_suite_test.go
Comment thread controllers/lmes/lmevaljob_controller_suite_test.go
Comment thread controllers/lmes/lmevaljob_controller_suite_test.go
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 13, 2026

@SudipSinha: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/trustyai-service-operator-e2e 28f145c link true /test trustyai-service-operator-e2e

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@SudipSinha SudipSinha merged commit b50a816 into trustyai-explainability:main May 14, 2026
8 of 10 checks passed
@SudipSinha SudipSinha deleted the fix/lmeval-https-ssl-rerun branch May 14, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants