fix(lmeval): auto-inject cluster CA bundle for HTTPS endpoints; allow re-run after spec change#729
Conversation
…n on spec change Bug 1 (RHOAIENG-60487): LMEval jobs created against models with external HTTPS routes fail with SSLCertVerificationError because the lm-eval pod has no cluster CA in its trust store. Fix: when any modelArg base_url uses https:// and verify_certificate is not already set, look up the standard RHOAI ConfigMap odh-trusted-ca-bundle in the job namespace, mount it into the pod, and set REQUESTS_CA_BUNDLE so Python's requests library picks it up automatically. The lookup is best-effort; missing ConfigMap is logged and the job proceeds unchanged. Bug 3 (RHOAIENG-60487): Once a job reaches CompleteJobState (including Failed reason) it is impossible to re-run it by editing the spec, because the operator ignores spec changes on completed jobs. Fix: record metadata.Generation in an annotation (trustyai.opendatahub.io/ last-scheduled-generation) whenever a pod is created. On each reconcile, if a completed job's current generation exceeds the stored value the status is reset to New and the job re-runs with the updated configuration. The lastGen > 0 guard means jobs that predate this change are never accidentally reset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds cluster CA bundle auto-injection into lm-eval pods for HTTPS model endpoints (when verify_certificate is unset) and enables re-execution of completed LMEvalJobs when their spec changes by recording and comparing scheduled generation annotations. ChangesCA Bundle Injection and Job Rescheduling
Sequence Diagram(s)sequenceDiagram
participant Controller
participant ConfigMapAPI as "k8s API / ConfigMap"
participant CreatePod
participant PodAPI as "k8s API / Pod"
Controller->>ConfigMapAPI: read odh-trusted/service-ca ConfigMaps
ConfigMapAPI-->>Controller: PEM data (if present)
Controller->>ConfigMapAPI: create-or-update merged per-job ConfigMap (merged PEM)
ConfigMapAPI-->>Controller: merged ConfigMap
Controller->>CreatePod: CreatePod(..., caBundle, caBundleKey)
CreatePod->>PodAPI: create Pod with ConfigMap volume (subPath) and env REQUESTS_CA_BUNDLE
PodAPI-->>CreatePod: Pod created
🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested Labels
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
controllers/lmes/lmevaljob_controller_test.go (1)
191-191: ⚡ Quick winAdd tests for the new CA-injection and generation-tracking paths.
These call-site updates keep the suite compiling, but there is still no coverage for the new behavior: HTTPS
base_url+ implicit CA injection, explicitverify_certificatebypass, missing ConfigMap fallback, and resume-after-spec-change generation tracking. The resume bug above would slip through as-is.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controllers/lmes/lmevaljob_controller_test.go` at line 191, Add unit tests in lmevaljob_controller_test.go around the CreatePod call to cover the new CA-injection and generation-tracking code paths: add cases where job.Spec.BaseURL is HTTPS and VerifyCertificate is omitted to assert implicit CA injection occurs, a case with VerifyCertificate explicitly set to false to assert certificate verification is bypassed, a case simulating a missing ConfigMap to exercise the fallback logic, and a resume scenario where job generation changes to assert the controller tracks and resumes correctly after spec changes; reference the CreatePod(...) call and NewDefaultPermissionConfig() usage to locate where to insert these table-driven subtests and assert the expected Pod spec/annotations and controller state transitions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controllers/lmes/lmevaljob_controller.go`:
- Around line 192-213: In the block that detects a spec change for a completed
job (the if where job.Status.State == lmesv1alpha1.CompleteJobState and lastGen
:= getLastScheduledGeneration(job) > 0 && job.Generation > lastGen), delete the
completed Pod (lookup by job.Status.PodName and namespace job.Namespace) before
resetting the job status and calling r.Status().Update; ignore NotFound errors
and handle AlreadyExists/other errors appropriately (return error on unexpected
failures), then clear job.Status.PodName and proceed with resetting fields and
the status update so a new Pod can be created without an AlreadyExists conflict.
- Around line 825-835: When resuming runs in handleResume the controller
recreates pods via CreatePod(Options, job, permConfig, caBundle, caBundleKey,
log) but does not update the LastScheduledGenerationAnnotation, causing a stale
scheduled-generation and triggering an unnecessary second run; modify
handleResume to record the current Job.Generation into the job's annotations
under LastScheduledGenerationAnnotation (the same annotation name used during
initial scheduling) and persist the update (e.g., via the existing client
update/patch helper) immediately after recreating the pod so the resumed run and
the annotation stay in sync.
---
Nitpick comments:
In `@controllers/lmes/lmevaljob_controller_test.go`:
- Line 191: Add unit tests in lmevaljob_controller_test.go around the CreatePod
call to cover the new CA-injection and generation-tracking code paths: add cases
where job.Spec.BaseURL is HTTPS and VerifyCertificate is omitted to assert
implicit CA injection occurs, a case with VerifyCertificate explicitly set to
false to assert certificate verification is bypassed, a case simulating a
missing ConfigMap to exercise the fallback logic, and a resume scenario where
job generation changes to assert the controller tracks and resumes correctly
after spec changes; reference the CreatePod(...) call and
NewDefaultPermissionConfig() usage to locate where to insert these table-driven
subtests and assert the expected Pod spec/annotations and controller state
transitions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c20923c5-7e91-4728-bb4b-9713b8d82b0b
📒 Files selected for processing (3)
controllers/lmes/constants.gocontrollers/lmes/lmevaljob_controller.gocontrollers/lmes/lmevaljob_controller_test.go
When a job is suspended then resumed, handleResume now records the current spec generation in the LastScheduledGenerationAnnotation before creating the pod. Without this, editing the spec while a job is suspended and then resuming it would leave a stale annotation, causing Reconcile to detect a false generation change and trigger an extra re-run after the resumed job completes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and re-run logic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
controllers/lmes/lmevaljob_controller_suite_test.go (1)
449-459: ⚡ Quick winAssert the exact reset contract, not just “not Complete”.
Line 455 only verifies the state changed away from
Complete; this can pass even if reset behavior regresses. Assert the expected post-reset state (New) and key cleared fields from the contract.Suggested tightening
- if job.Status.State == lmesv1alpha1.CompleteJobState { - return fmt.Errorf("job is still Complete, waiting for re-run reset") - } - return nil + if job.Status.State != lmesv1alpha1.NewJobState { + return fmt.Errorf("expected New after reset, got %s", job.Status.State) + } + if job.Status.PodName != "" { + return fmt.Errorf("expected PodName to be cleared, got %q", job.Status.PodName) + } + return nil🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controllers/lmes/lmevaljob_controller_suite_test.go` around lines 449 - 459, The test currently only checks that job.Status.State is not lmesv1alpha1.CompleteJobState after the spec change; instead assert the exact reset contract by waiting for job.Status.State to equal the expected reset state (e.g., lmesv1alpha1.NewJobState) using the same WaitFor helper, and additionally assert key cleared fields on the job Status (for example ensure job.Status.RunID is empty/nil and job.Status.StartTime/FinishTime are zero or unset) to guarantee the reset cleared runtime metadata; locate the check around WaitFor, job and job.Status.State to implement these assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controllers/lmes/lmevaljob_controller_suite_test.go`:
- Around line 506-511: The closure passed to Consistently is ignoring errors
from k8sClient.Get which can hide transient failures; change it to check the Get
result (e.g., call k8sClient.Get and assert Expect(err).Should(Succeed()) inside
the closure or return a non-matching sentinel state on error) so failures don't
get masked—update the closure around k8sClient.Get(...) that returns
job.Status.State (used with Consistently) to either fail the test on Get error
or convert the error into a distinct JobState that will not equal
lmesv1alpha1.CompleteJobState.
---
Nitpick comments:
In `@controllers/lmes/lmevaljob_controller_suite_test.go`:
- Around line 449-459: The test currently only checks that job.Status.State is
not lmesv1alpha1.CompleteJobState after the spec change; instead assert the
exact reset contract by waiting for job.Status.State to equal the expected reset
state (e.g., lmesv1alpha1.NewJobState) using the same WaitFor helper, and
additionally assert key cleared fields on the job Status (for example ensure
job.Status.RunID is empty/nil and job.Status.StartTime/FinishTime are zero or
unset) to guarantee the reset cleared runtime metadata; locate the check around
WaitFor, job and job.Status.State to implement these assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dcd337eb-5b0c-4f3e-8d99-9d5ca0257b03
📒 Files selected for processing (3)
controllers/lmes/lmevaljob_controller.gocontrollers/lmes/lmevaljob_controller_suite_test.gocontrollers/lmes/lmevaljob_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- controllers/lmes/lmevaljob_controller.go
The Consistently assertion was silently discarding k8sClient.Get errors, which could mask test infrastructure failures as false passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
The CreatePod signature was extended with caBundle and caBundleKey parameters but this call site was not updated, causing a build failure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controllers/job_mgr/job_mgr_controller.go`:
- Line 123: The Pod spec used for Kueue scheduling omits CA bundle volumes
because CreatePod is being called with nil/"" for caBundle and caBundleKey in
job_mgr_controller.go; update the call to lmes.CreatePod(lmes.Options,
&job.LMEvalJob, permConfig, ...) to pass the proper ConfigMap reference and key
for the CA bundle (non-nil caBundle and non-empty caBundleKey) so PodSets()
includes the volumes/mounts and REQUESTS_CA_BUNDLE env var, or alternatively
adjust the PodSets() construction path to inject the same CA bundle
volumes/mounts/env when CA will be added at actual pod creation; locate the call
site of CreatePod and supply the same ConfigMap name/key used elsewhere (or
refactor PodSets/CreatePod to accept a shared caBundle value) so scheduling spec
matches the real pod.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 28a67dff-aba6-40ca-9535-90913674190e
📒 Files selected for processing (2)
controllers/job_mgr/job_mgr_controller.gocontrollers/lmes/lmevaljob_controller_suite_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- controllers/lmes/lmevaljob_controller_suite_test.go
The metrics code registers a CounterVec with label names from the first LMEvalJob it sees. If a subsequent job has a different set of modelArgs (e.g. one has base_url and another doesn't), the label count mismatches and Prometheus panics with "inconsistent label cardinality". Fix by always initializing both model_name and base_url labels to empty strings before populating from modelArgs, ensuring every job produces the same 6-label set. This is a pre-existing bug on main that was not surfaced because there was only one LMES test case. The new CA bundle tests in this PR create jobs with varying modelArgs, exposing the panic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
|
Hi @SudipSinha , I tested on my cluster with
So Maybe: For cluster-internal |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ruivieira The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
New changes are detected. LGTM label has been removed. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
controllers/lmes/lmevaljob_controller.go (1)
1875-1888: ⚡ Quick winUse
controllerutil.SetControllerReference()for the merged ConfigMap.This manually builds the owner reference for the new ConfigMap instead of using the controller-runtime helper the repo standardizes on. Switching to
controllerutil.SetControllerReference()keeps ownership wiring consistent and avoids drifting fields across controller-created resources.Suggested refactor
mergedCM = &corev1.ConfigMap{ ObjectMeta: v1.ObjectMeta{ Name: mergedCMName, Namespace: job.Namespace, - OwnerReferences: []v1.OwnerReference{ - { - APIVersion: job.APIVersion, - Kind: job.Kind, - Name: job.Name, - Controller: &ownerRefController, - UID: job.UID, - }, - }, }, Data: map[string]string{ MergedCABundleKey: merged, }, } + if err := controllerutil.SetControllerReference(job, mergedCM, r.Scheme); err != nil { + return nil, "", fmt.Errorf("failed to set owner reference on merged CA ConfigMap: %w", err) + } if err := r.Create(ctx, mergedCM); err != nil { return nil, "", fmt.Errorf("failed to create merged CA ConfigMap: %w", err) }As per coding guidelines, "Use owner references via
controllerutil.SetControllerReference()for resource cleanup and garbage collection".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controllers/lmes/lmevaljob_controller.go` around lines 1875 - 1888, The code manually constructs OwnerReferences for the merged ConfigMap (mergedCM) using ownerRefController and job fields; replace this manual wiring by creating the ConfigMap without OwnerReferences and then call controllerutil.SetControllerReference(job, mergedCM, r.Scheme) (using the reconciler's scheme) to set ownership; ensure you import controllerutil and handle the error returned by SetControllerReference before creating/updating the ConfigMap so ownership is set consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controllers/lmes/lmevaljob_controller.go`:
- Around line 567-579: The current reconcile branch around
hasHTTPSBaseURL/hasExplicitVerifyCertificate calls findAndMergeCABundle and
treats any error as "no CA bundle", which hides real Create/Update failures;
change the logic in the block that calls r.findAndMergeCABundle (and the
analogous code in handleResume) to inspect the returned error: if the error is
the specific "no source CA data" sentinel (or error type/value that indicates no
data) then log and proceed as now, but for any other error return that error (or
requeue by returning ctrl.Result{}, err) so reconcile will retry; keep assigning
caBundle/caBundleKey only on nil error. Reference findAndMergeCABundle,
hasHTTPSBaseURL, hasExplicitVerifyCertificate, and handleResume when locating
the code to change.
- Around line 1894-1904: The controller now creates and updates a per-job
ConfigMap (see mergedCM and the code paths that call r.Create and r.Update to
write MergedCABundleKey), but the kubebuilder RBAC markers only grant
get/watch/list for ConfigMaps; add a kubebuilder RBAC marker to permit create,
update (and patch/delete if you want full lifecycle) on ConfigMaps so the
generated ClusterRole allows the controller to write the merged CA bundle
in-cluster (e.g. add a line like
//+kubebuilder:rbac:groups="",resources=configmaps,verbs=get;list;watch;create;update;patch;delete
next to the existing RBAC markers in the lmevaljob controller file).
---
Nitpick comments:
In `@controllers/lmes/lmevaljob_controller.go`:
- Around line 1875-1888: The code manually constructs OwnerReferences for the
merged ConfigMap (mergedCM) using ownerRefController and job fields; replace
this manual wiring by creating the ConfigMap without OwnerReferences and then
call controllerutil.SetControllerReference(job, mergedCM, r.Scheme) (using the
reconciler's scheme) to set ownership; ensure you import controllerutil and
handle the error returned by SetControllerReference before creating/updating the
ConfigMap so ownership is set consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: efa2e1cf-2c90-4571-a7d3-1b9b8c2a8f9f
📒 Files selected for processing (3)
controllers/lmes/constants.gocontrollers/lmes/lmevaljob_controller.gocontrollers/lmes/lmevaljob_controller_suite_test.go
✅ Files skipped from review due to trivial changes (1)
- controllers/lmes/constants.go
🚧 Files skipped from review as they are similar to previous changes (1)
- controllers/lmes/lmevaljob_controller_suite_test.go
89f7544 to
c2b8257
Compare
The previous CA injection mounted a single key from odh-trusted-ca-bundle, which contains only public/system CAs. Cluster-internal services using OpenShift service-serving certificates (*.svc.cluster.local) are signed by a different CA in the openshift-service-ca.crt ConfigMap, so the pod still got SSLCertVerificationError. Additionally, REQUESTS_CA_BUNDLE replaces Python's default trust store rather than appending, so mounting only one CA source loses trust in all others. Fix: replace findCABundle with findAndMergeCABundle, which collects PEM data from both odh-trusted-ca-bundle and openshift-service-ca.crt (best-effort, each skipped if absent), concatenates them, and creates a per-job merged ConfigMap (<jobName>-ca-bundle) with an owner reference for automatic GC. The pod mounts this merged ConfigMap so REQUESTS_CA_BUNDLE contains all relevant CAs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
c2b8257 to
28f145c
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controllers/lmes/lmevaljob_controller_suite_test.go`:
- Around line 337-340: The Gomega assertion uses a printf-style format string
which Gomega won't process; change the Expect call that checks
apierrors.IsAlreadyExists(err) after creating svcCAConfigMap to build the
message with fmt.Sprintf (e.g. fmt.Sprintf("unexpected error creating service CA
ConfigMap: %v", err)) and pass that single string as the failure message, or
alternatively assert on err directly (e.g.
Expect(err).ToNot(HaveOccurred()))—update the
Expect(apierrors.IsAlreadyExists(err)).To(BeTrue(), ...) invocation accordingly.
- Around line 615-619: The Gomega assertion passes a printf-style format and
args to To(), which won't process format verbs; change the assertion so the
message is a single formatted string — e.g., when checking
apierrors.IsAlreadyExists(err) after k8sClient.Create(ctx, caConfigMap), wrap
the message with fmt.Sprintf (or build the string) and pass that single string
to Expect(...).To(BeTrue(), <formatted message>), referencing the existing call
sites: k8sClient.Create, caConfigMap, apierrors.IsAlreadyExists, Expect and
BeTrue.
- Around line 230-233: The Gomega assertion uses a format verb in the message
which won't be expanded; update the Expect call that wraps k8sClient.Create and
apierrors.IsAlreadyExists to pass a fully formatted string (e.g. use
fmt.Sprintf("unexpected error creating CA ConfigMap: %v", err)) as the second
argument to To, and add an import for fmt if it's not already present; target
the Expect(...) invocation that calls apierrors.IsAlreadyExists(err) and the
surrounding k8sClient.Create error branch.
- Around line 323-326: The Gomega assertion passes a format verb to To() which
doesn't process fmt verbs; change the call that checks the result of
k8sClient.Create for odhCAConfigMap from
Expect(apierrors.IsAlreadyExists(err)).To(BeTrue(), "unexpected error creating
ODH CA ConfigMap: %v", err) to build the message beforehand (e.g. use
fmt.Sprintf) and pass the final string:
Expect(apierrors.IsAlreadyExists(err)).To(BeTrue(), fmt.Sprintf("unexpected
error creating ODH CA ConfigMap: %v", err)); remember to add the fmt import to
the test file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 207e1fa1-39ea-4f0a-a500-53d9bf5778cd
📒 Files selected for processing (5)
config/components/lmes/rbac/manager-rbac.yamlcontrollers/lmes/constants.gocontrollers/lmes/lmevaljob_controller.gocontrollers/lmes/lmevaljob_controller_suite_test.gocontrollers/lmes/lmevaljob_controller_test.go
✅ Files skipped from review due to trivial changes (2)
- config/components/lmes/rbac/manager-rbac.yaml
- controllers/lmes/constants.go
🚧 Files skipped from review as they are similar to previous changes (2)
- controllers/lmes/lmevaljob_controller.go
- controllers/lmes/lmevaljob_controller_test.go
|
@SudipSinha: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
b50a816
into
trustyai-explainability:main
Jira: RHOAIENG-60487
Summary
Fixes two issues in the LMEval operator.
SSL failure on HTTPS endpoints (external routes and cluster-internal services)
When a model is deployed with Make model deployment available through an external route enabled, OpenShift creates an HTTPS ingress signed by the cluster's self-signed CA. Similarly, cluster-internal services (
*.svc.cluster.local) use certificates signed by the OpenShift service-serving CA. The lm-eval pod has no knowledge of either CA, so Python'srequestslibrary rejects the connection:REQUESTS_CA_BUNDLEreplaces Python's default trust store entirely, so mounting a single CA source loses all others.Fix: in
handleNewCRandhandleResume, if anymodelArgnamedbase_urluses anhttps://scheme andverify_certificateis not already set, the operator collects CA certificates from multiple well-known cluster sources:odh-trusted-ca-bundle:ca-bundle.crtandodh-ca-bundle.crtkeys (public/system CAs, injected by the RHOAI operator)openshift-service-ca.crt:service-ca.crtkey (service-serving CA for*.svc.cluster.local)All found PEM data is concatenated into a per-job merged ConfigMap (
<jobName>-ca-bundle) with an owner reference to the LMEvalJob for automatic GC. The merged ConfigMap is mounted at/etc/ssl/certs/odh-ca-bundle.crtandREQUESTS_CA_BUNDLEis set to that path. Each source is best-effort — missing ConfigMaps are skipped, but real API errors (Create/Update failures on the merged ConfigMap) are propagated so the reconciler retries.Completed job cannot be re-run after spec edit
Once a job reaches
CompleteJobState(includingFailedreason), editing the spec has no effect. The operator does not watch for spec changes on completed jobs, and there is no supported way to reset the status through the status subresource viaoc patch.Fix:
handleNewCRandhandleResumeboth write the currentmetadata.Generationinto an annotation (trustyai.opendatahub.io/last-scheduled-generation) each time a pod is created. InReconcile, a new guard checks completed jobs: if the current generation exceeds the stored value, the completed pod is deleted and the status is fully reset toNew(clearinglastScheduleTime,completeTime,podName,reason,message,progressBars,results); the next reconcile cycle re-runs the job. ThelastGen > 0guard means jobs created before this change are never accidentally reset. Recording the annotation inhandleResumealso prevents a spurious extra re-run when the spec is edited while a job is suspended and then resumed.Pre-existing bug fix: metrics label cardinality panic
The
createJobCreationMetricsfunction in the controller builds a PrometheusCounterVecwith label names derived from the firstLMEvalJobit processes. If a subsequent job has differentmodelArgs(e.g. one hasbase_urland another doesn't), the label count mismatches and Prometheus panics withinconsistent label cardinality. This bug exists onmainbut is not surfaced because there is only one LMES test case. Our new CA bundle tests create jobs with varyingmodelArgs, exposing the panic.Fix: always initialize both
model_nameandbase_urllabels to empty strings before populating frommodelArgs, ensuring every job produces the same 6-label set. This is committed separately from the main fixes.Steps to reproduce
Setup
disableLMEvaltofalsein theOdhDashboardConfigCRBug 1: SSL failure on HTTPS endpoints
Create an evaluation run from the RHOAI Dashboard UI targeting the deployed model. The UI generates an
LMEvalJobCR withbase_url: https://<model>.<namespace>.apps.<cluster-domain>/v1/completions.The pod starts, downloads the dataset, but fails at the first API request:
The job enters
Completestate withreason: Failed.The RHOAI Dashboard UI provides no mechanism to set
verify_certificateor mount custom CA bundles, so there is no workaround from the UI.Minimal reproducer (CLI):
Any
LMEvalJobwith anhttps://value inbase_urltriggers the issue. The operator does not inject cluster CA certificates, so the pod has no way to verify the server's certificate.Note: Our envtest integration tests use
model: hfwith a syntheticbase_urlbecause they only verify the operator's pod-spec injection (volumes, mounts, env vars) — lm-eval never actually runs in envtest.Bug 2: Completed job cannot be re-run after spec edit
Continuing from Bug 1, the user tries to fix the failed job by editing the spec:
The job is in
Completestate withreason: Failedafter the SSL error.Edit the LMEvalJob spec (e.g. change a modelArg):
oc edit lmevaljob test-ssl # Change the pretrained model value, or add verify_certificateNothing happens — the job stays
Completeand no new pod is created:Attempting to reset the status via
oc patchhas no effect (the status subresource is separate):$ oc patch lmevaljob test-ssl --type=merge -p '{"status": {"state": "New"}}' lmevaljob.trustyai.opendatahub.io/test-ssl patched (no change)Deleting the failed pod does not trigger a new pod creation either.
The only workaround is to delete the entire
LMEvalJoband create a new one.Note: RHOAI Dashboard UI does not expose TLS options (not addressed here)
The RHOAI Dashboard UI creates
LMEvalJobCRs with no mechanism for users to setverify_certificate, mount custom CA bundles, or reference cert secrets. This means jobs created from the UI will always fail against external HTTPS endpoints unless the operator handles CA injection automatically.This PR addresses the common case: when
odh-trusted-ca-bundleexists in the namespace (injected by the RHOAI operator), the cluster CA is mounted transparently and UI-created jobs with HTTPS endpoints work without any user intervention.However, edge cases remain where automatic injection is not sufficient:
odh-trusted-ca-bundleinto the namespaceverify_certificate: "False") for testingThese cases still require CLI intervention (
oc edit/oc patchon theLMEvalJobspec). Exposing TLS options in the Dashboard UI is a separate concern tracked outside this PR.Changed files
controllers/lmes/constants.gocontrollers/lmes/lmevaljob_controller.goCreatePodsignature extended with CA params;findAndMergeCABundlemerges multiple CA sources into per-job ConfigMap;resolveCABundlehelper with proper error handling (sentinel for no-data vs. API errors);recordScheduledGenerationhelper; re-run detection with pod cleanup inReconcile; metrics label cardinality fix; RBAC marker updated tocreate;updatefor ConfigMapscontrollers/lmes/lmevaljob_controller_test.goCreatePodcalls updated to passnil, ""for the new CA bundle params; unit tests forhasHTTPSBaseURL,hasExplicitVerifyCertificate,getLastScheduledGeneration,CreatePodWithCABundle,CreatePodWithoutCABundlecontrollers/lmes/lmevaljob_controller_suite_test.gocontrollers/job_mgr/job_mgr_controller.goCreatePodcall updated for new CA bundle parameters (passesnil, ""— tracked in #730)config/components/lmes/rbac/manager-rbac.yamlcreate;updateverbs for ConfigMapsTesting
Unit / integration tests (envtest)
Tests in
controllers/lmes/lmevaljob_controller_suite_test.gorun against a controller-runtime envtest environment with real CRDs:base_url: https://...gets a merged ConfigMap containing the CA data, with the volume, mount at/etc/ssl/certs/odh-ca-bundle.crt, andREQUESTS_CA_BUNDLEenv varodh-trusted-ca-bundleandopenshift-service-ca.crtexist, the merged ConfigMap contains data from both sourcesbase_url: http://...has no CA volume or env varbase_url: https://...ANDverify_certificate: falsehas no CA volume or env varmetadata.Generation) is reset toNewand a new pod is created; thelast-scheduled-generationannotation is setComplete(no spurious re-run)Manual integration tests (live OpenShift cluster)
Operator run locally (
go run ./cmd/main.go --enable-services LMES) against a live OpenShift cluster. Both bugs were first reproduced onmain, then verified fixed on the PR branch. Each test created anLMEvalJobCR and inspected the resulting pod spec and job status.An
odh-trusted-ca-bundleConfigMap was pre-created in the test namespace to simulate the RHOAI-managed CA bundle.mainREQUESTS_CA_BUNDLEverify_certificatehas no CA injectionNewafter spec editCompleteCompletewhen spec is unchangedCompleteCompletelast-scheduled-generationannotation set on schedule1All tests passed on both envtest and live cluster.
CI status
build(unit + integration tests)deployci/prow/imageslintAllTheThingsGosec Security Scannerci/prow/trustyai-service-operator-e2eTrivy Security Scango.opentelemetry.io/otel(CVE-2026-29181) andurllib3intests/Pipfile.lock(CVE-2026-44431, CVE-2026-44432). Failing on all PRs includingmain. Not related to this PR's changes.🤖 Generated with Claude Code