Integrate NVIDIA NIM Operator into RHOAI#6175
Integrate NVIDIA NIM Operator into RHOAI#6175mtalvi wants to merge 27 commits intoopendatahub-io:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds NVIDIA NIM Operator support across backend and frontend. Backend: new DashboardConfig flag Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/ManageNIMServingModal.tsx (1)
474-493:⚠️ Potential issue | 🔴 CriticalStale state: storage override for existing PVC is never applied to the submission.
setCreateDataInferenceService('storage', ...)on line 476 enqueues a React state update, but thecreateDataInferenceServicereference in this closure still holds the old value. WhengetSubmitInferenceServiceResourceFn(createDataInferenceService, ...)is called on line 484, it receives the pre-update state — the storage override foruse-existingPVC mode is silently ignored.Proposed fix: construct the data object directly instead of using the React state setter
+ let inferenceServiceData = createDataInferenceService; if (pvcMode === 'use-existing') { // For existing PVC, configure storage to use local path instead of remote URI - setCreateDataInferenceService('storage', { + inferenceServiceData = { + ...createDataInferenceService, + storage: { - type: InferenceServiceStorageType.EXISTING_URI, - path: modelPath, - dataConnection: '', - uri: modelPath, - awsData: EMPTY_AWS_SECRET_DATA, - }); + type: InferenceServiceStorageType.EXISTING_URI, + path: modelPath, + dataConnection: '', + uri: modelPath, + awsData: EMPTY_AWS_SECRET_DATA, + }, + }; } const submitInferenceServiceResource = getSubmitInferenceServiceResourceFn( - createDataInferenceService, + inferenceServiceData, editInfo?.inferenceServiceEditInfo,
🤖 Fix all issues with AI agents
In `@frontend/src/api/k8s/nimServices.ts`:
- Around line 200-228: assembleNIMService currently clones an existing
nimService and only sets optional fields when new values are present, leaving
stale values in nimService.spec (e.g., spec.resources, spec.env,
spec.nodeSelector, spec.tolerations) when the user clears them; update
assembleNIMService to explicitly remove/clear these fields when their inputs are
absent (for example, if resources is falsy delete nimService.spec.resources, if
envVars.length === 0 delete nimService.spec.env, if !nodeSelector ||
Object.keys(nodeSelector).length === 0 delete nimService.spec.nodeSelector, and
if !tolerations || tolerations.length === 0 delete nimService.spec.tolerations)
so the edited NIMService no longer retains old configuration.
In `@frontend/src/concepts/hardwareProfiles/const.ts`:
- Line 2: Update the import to follow project convention by removing the .ts
extension: change the import that currently reads importing
HardwareProfileFeatureVisibility and InferenceServiceKind from '#~/k8sTypes.ts'
so it imports from '#~/k8sTypes' instead; modify the import statement that
references k8sTypes accordingly so other modules remain consistent.
In `@frontend/src/pages/modelServing/screens/global/nimOperatorUtils.ts`:
- Around line 209-224: The hook useInferenceServiceDisplayName re-triggers on
unstable object references and allows stale async results to overwrite state;
change the effect to depend on stable primitive keys (e.g.,
inferenceService.metadata.name and inferenceService.metadata.namespace or
another unique id returned by getDisplayNameFromK8sResource) instead of the
entire inferenceService object, and add an abort/stale-guard inside the effect
(e.g., a cancelled flag or incrementing request token checked before calling
setDisplayName, and cleaned up in the effect's return) so that only the latest
getInferenceServiceDisplayName result updates state; keep the initial state from
getDisplayNameFromK8sResource and reference getInferenceServiceDisplayName,
useInferenceServiceDisplayName, and setDisplayName when locating the code to
modify.
In
`@frontend/src/pages/modelServing/screens/projects/nim/__tests__/nimUtils.spec.ts`:
- Around line 392-410: The test uses inconsistent apiVersion values for
NIMService mocks: update the inline anotherNIMService to match
createMockNIMService()'s apiVersion ('apps.nvidia.com/v1alpha1') so all test
fixtures are consistent; locate the anotherNIMService object in nimUtils.spec.ts
and change its apiVersion to 'apps.nvidia.com/v1alpha1' (also review other mocks
returned by listNIMServices to ensure they all use the same apiVersion).
In
`@frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/ManageNIMServingModal.tsx`:
- Around line 535-546: The PVC description is read using the wrong annotation
key in ManageNIMServingModal.tsx causing openshift.io/description to be lost on
update; change the extraction of the description when building updatePvcData to
read pvc.metadata.annotations?.['openshift.io/description'] (or fallback to ''
only if that specific key is missing) so updatePvc(...) receives the correct
description and updatePvc (frontend/src/api/k8s/pvcs.ts) no longer clears the
openshift.io/description annotation; ensure you update the reference in the
block where updatePvcData is constructed inside ManageNIMServingModal (around
the conditional that compares pvc.spec.resources.requests.storage to pvcSize).
- Around line 326-328: The current parsing of createDataServingRuntime.imageName
using split(':') incorrectly breaks images with registry ports; update the logic
that sets imageRepository and imageTag in ManageNIMServingModal (the variables
imageRepository and imageTag) to split on the last colon only: find the
lastIndexOf(':') on createDataServingRuntime.imageName, if no colon treat the
whole string as repository and default tag to 'latest', otherwise take substring
before the last colon as imageRepository and substring after as imageTag; ensure
empty/null checks for createDataServingRuntime.imageName are preserved.
- Around line 378-380: The code is using unnecessary dynamic imports for
getNIMService and getInferenceService; statically import them instead alongside
the already imported createNIMService and updateNIMService from the same module.
Update the import block that currently includes createNIMService and
updateNIMService to also import getNIMService and getInferenceService, then
remove the dynamic import calls and directly call getNIMService(nimServiceName,
namespace) and getInferenceService(inferenceServiceName, namespace) where they
are used (lines around the existing dynamic imports).
- Around line 507-511: The submitNIMService path is missing the InferenceService
restart that submitLegacyMode performs: after the NIMService update (the
NIMService patch/update call in submitNIMService) add the same logic used in
submitLegacyMode to detect editInfo?.inferenceServiceEditInfo and call await
patchInferenceServiceStoppedStatus(editInfo.inferenceServiceEditInfo, 'false')
so the InferenceService is restarted on NIM Operator edits; ensure the call is
awaited and any existing error handling pattern around
patchInferenceServiceStoppedStatus is followed for consistency.
In
`@frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/NIMPVCSelector.tsx`:
- Around line 63-79: The effect that forces modelPath should only run when the
mode changes, not on every modelPath change; update the first React.useEffect to
depend only on nimServicesEnabled (remove modelPath and setModelPath from its
dependency array) and keep the conditional setModelPath(correctPath) so it runs
once on mode switch; remove the second redundant effect that auto-sets when
existingPvcName is present; if the intended behavior is to lock the input in
both modes instead, ensure the input's isDisabled prop (currently tied to
nimServicesEnabled) is updated accordingly instead of changing the effects.
In
`@frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/useNIMCompatiblePVCs.ts`:
- Around line 72-84: The function extractPVCFromNIMService contains a redundant
startsWith('nim-pvc') branch that returns the same pvcName in both branches;
simplify by removing the conditional and directly return the pvc name (or null
if missing) — update the function extractPVCFromNIMService to compute pvcName
from nimService.spec.storage?.pvc?.name, return null when falsy, otherwise
return pvcName without the startsWith check.
- Around line 106-116: The code in parseNimModelFromImage (variable parsed) uses
parsed.lastIndexOf('--') to strip a namespace prefix which can mis-handle model
names that themselves contain '--'; change the logic to use parsed.indexOf('--')
(first occurrence) to detect the namespace separator and then return
parsed.substring(firstIndex + 2) when indexOf('--') !== -1, otherwise return
parsed as-is; update the block that currently references lastIndexOf('--') to
use indexOf('--') so namespace stripping only removes the leading segment.
In `@frontend/src/pages/modelServing/screens/projects/useNIMServicesEnabled.ts`:
- Around line 30-40: Delete the contradictory block of comments that claims the
logic needs inversion then immediately contradicts itself (the lines that
mention "invert the logic" and the subsequent correction about
useIsAreaAvailable), leaving only the concise explanation that
useIsAreaAvailable already yields status: true when NIM services are enabled;
keep the existing JSDoc/comments that describe disableNIMServices semantics
(references: useIsAreaAvailable, disableNIMServices, NIM Services) so the file
no longer contains confusing leftover debugging notes.
In `@frontend/src/pages/modelServing/screens/projects/utils.ts`:
- Around line 196-206: The effect calling getInferenceServiceDisplayName in
React.useEffect (dependent on existingData and isNIMManaged) can update state
after the component has re-rendered with new props; add a local "isMounted" (or
"cancelled") boolean flag inside the effect, set it true initially and flip to
false in the cleanup, then only call setExistingName(name) when the flag
indicates the effect is still current; ensure the .catch branch also checks the
flag before any state or side-effect and leave the console.error for debugging.
🧹 Nitpick comments (17)
frontend/src/pages/modelServing/useInferenceServiceStatus.ts (2)
46-70: Stale closure:modelPodinside thesetIntervalcallback never reflects updates.The
modelPodcaptured by the interval callback is the value from when the effect last ran. Theif (!modelPod)check on line 50 cannot observe the result ofawait refreshModelPodStatus()— it always sees the stale closure value. The polling happens to work becausemodelPodis in the dependency array, so when it changes React re-runs the effect, but the explicit check inside the callback is misleading and never the thing that actually stops polling.Consider using a ref to track the latest
modelPod, or restructuring so the stop condition is evaluated outside the interval:♻️ Suggested approach using a ref
+ const modelPodRef = React.useRef(modelPod); + modelPodRef.current = modelPod; + // Manual polling when isStopping is true React.useEffect(() => { if (isStopping) { const interval = setInterval(async () => { await refreshModelPodStatus(); - if (!modelPod) { + if (!modelPodRef.current) { setIsStopping(false); } }, FAST_POLL_INTERVAL);
39-43: Object-reference dependency may cause unnecessary refresh calls.
inferenceService.status?.modelStatus?.statesis an object. If the parent re-renders with a newinferenceServiceobject (common with k8s watch/polling), this effect fires even when the state values haven't changed, triggering a redundantrefreshModelPodStatus()call each time.Consider depending on the specific primitive values (
activeModelState,targetModelState) instead:♻️ Suggested fix
React.useEffect(() => { refreshModelPodStatus(); - }, [inferenceService.status?.modelStatus?.states, refreshModelPodStatus]); + }, [ + inferenceService.status?.modelStatus?.states?.activeModelState, + inferenceService.status?.modelStatus?.states?.targetModelState, + refreshModelPodStatus, + ]);backend/src/utils/constants.ts (1)
81-81: Consider rephrasing the inline comment.The comment "Set to false for testing NIM Operator integration" reads as developer instructions rather than documenting the flag's purpose. Other
disable*flags in this block don't have such comments. Consider either removing it or rephrasing to describe what the flag controls (e.g.,// NIM Operator integration - creates NIMService instead of InferenceService when false), matching the style used in the manifest.packages/kserve/src/modelFormat.ts (1)
2-6: Cross-package import with lint suppression warrants attention.The
eslint-disablefor@odh-dashboard/no-restricted-importssuggests this import violates the intended package boundary betweenpackages/kserveand@odh-dashboard/internal. Consider whetherisNIMOperatorManagedandgetModelNameFromNIMInferenceServiceshould be exposed through a shared utility package or a proper public API surface of the internal package, rather than suppressing the lint rule.If this is intentional and temporary (given the WIP nature of this PR), a
// TODOexplaining the plan to resolve this would be helpful.frontend/src/pages/projects/screens/detail/overview/serverModels/deployedModels/DeployedModelCard.tsx (1)
39-40: Consider optimizing theuseInferenceServiceDisplayNamehook's dependency array.The hook uses
[inferenceService]as itsuseEffectdependency. Since this is an object reference, if the parent component creates a new reference on each render (common when fetching K8s resources), the effect will unnecessarily re-trigger and refetch the display name. Update the dependency to use a stable identifier instead, such asinferenceService.metadata.uidcombined withinferenceService.metadata.namespace, to prevent unnecessary async calls.frontend/src/concepts/hardwareProfiles/useServingHardwareProfileConfig.ts (1)
19-28: Duplicatedas anycast for NIM predictor containers — consider a shared accessor.The same
as anycast pattern to accesspredictor.containers[0].resourcesappears here, ingetModelNameFromNIMInferenceService(nimOperatorUtils.ts lines 149-150), and ingetInferenceServiceHardwareProfilePaths(const.ts). A single typed helper (e.g.,getNIMPredictorResources(inferenceService)) innimOperatorUtils.tswould eliminate the repeated unsafe casts and keep the NIM-specific predictor shape in one place.frontend/src/k8sTypes.ts (1)
1631-1996: Well-documented CRD type definition; consider extracting the repeated probe shape.The liveness, readiness, and startup probe blocks (lines 1832–1923) share an identical structure. Extracting a
NIMServiceProbeConfigtype would reduce ~90 lines of duplication.♻️ Suggested type extraction
+type NIMServiceProbeConfig = { + enabled?: boolean; + probe?: { + httpGet?: { + path?: string; + port: number | string; + host?: string; + scheme?: string; + httpHeaders?: { name: string; value: string }[]; + }; + exec?: { command?: string[] }; + tcpSocket?: { port: number | string; host?: string }; + grpc?: { port: number; service?: string }; + initialDelaySeconds?: number; + periodSeconds?: number; + timeoutSeconds?: number; + failureThreshold?: number; + successThreshold?: number; + terminationGracePeriodSeconds?: number; + }; +}; + // Then in NIMServiceKind.spec: - livenessProbe?: { ... }; - readinessProbe?: { ... }; - startupProbe?: { ... }; + livenessProbe?: NIMServiceProbeConfig; + readinessProbe?: NIMServiceProbeConfig; + startupProbe?: NIMServiceProbeConfig;frontend/src/pages/modelServing/screens/projects/ServingRuntimeDetails.tsx (1)
44-53: Consider extracting a shared utility for NIM predictor resource extraction instead of duplicatingas anycasts.The
as anycast onisvc.spec.predictorto accesscontainers[0].resourcesappears here and inutils.ts(line 257). A shared utility (e.g.,getNIMPredictorResources(isvc)) would centralize the unsafe cast and make it easier to replace onceInferenceServiceKindtypes are extended for NIM operator predictor shapes.♻️ Example utility
// In nimOperatorUtils.ts export const getNIMPredictorContainerResources = ( inferenceService: InferenceServiceKind, ): ContainerResources | undefined => { // eslint-disable-next-line `@typescript-eslint/consistent-type-assertions`, `@typescript-eslint/no-explicit-any` const predictor = inferenceService.spec.predictor as any; return predictor?.containers?.[0]?.resources; };Then in ServingRuntimeDetails.tsx:
- if (isNIMManaged) { - // NIM Operator uses containers instead of model spec - // eslint-disable-next-line `@typescript-eslint/consistent-type-assertions`, `@typescript-eslint/no-explicit-any` - const predictor = isvc.spec.predictor as any; - resources = predictor?.containers?.[0]?.resources; - } else { + if (isNIMManaged) { + resources = getNIMPredictorContainerResources(isvc); + } else {frontend/src/pages/modelServing/screens/projects/utils.ts (1)
247-264: Duplicatedas anycast on predictor for NIM container access.Same pattern as
ServingRuntimeDetails.tsx. Consolidating into a shared utility (e.g.,getNIMPredictorContainerEnvVars) would reduce duplication and isolate the unsafe cast.frontend/src/pages/modelServing/screens/global/DeleteInferenceServiceModal.tsx (1)
54-86: NIM Operator deletion path looks correct overall, but theisKServeNIMEnabledguard may be overly restrictive.If an InferenceService has a NIMService ownerReference (line 57), NIM is definitively in use in that project. The
isKServeNIMEnabled && projectcheck on line 65 could skip resource cleanup for NIM Operator deployments in projects where the annotationopendatahub.io/nim-supportis missing (e.g., stale projects, migration scenarios). This would leave orphaned PVCs/secrets behind silently, with only a skipped code path — no warning logged.Consider whether
nimServiceOwnerbeing present is sufficient to proceed with cleanup, or at minimum log a warning when skipping.frontend/src/__mocks__/mockNimService.ts (1)
72-93: WhenpvcNameisundefined,spec.storage.pvcstill includesname: undefined.If a caller passes
pvcName: undefined(like the test atuseNIMCompatiblePVCs.spec.tsline 228), the resulting object hasstorage: { pvc: { name: undefined } }rather than omitting thepvckey entirely. This doesn't match a real NIMService without PVC storage. The test at line 232 works around this by manually overridingspec.storage = {}, but it would be cleaner if the mock conditionally omitted thepvcblock whenpvcNameis not provided.♻️ Suggested improvement
storage: { - pvc: { - name: pvcName, - ...(pvcSubPath && { subPath: pvcSubPath }), - }, + ...(pvcName != null && { + pvc: { + name: pvcName, + ...(pvcSubPath && { subPath: pvcSubPath }), + }, + }), },frontend/src/pages/modelServing/screens/projects/nim/nimUtils.ts (2)
333-449: Significant code duplication:getNIMOperatorResourcesToDeleteduplicates ~80% ofgetNIMResourcesToDelete.The PVC deletion logic (reference counting, two-tier Dashboard-managed check, BYO-PVC preservation) is copy-pasted between the two functions (compare lines 357-418 with lines 227-295). The only difference is how the PVC name is extracted (from
nimService.spec.storage?.pvc?.namevs.servingRuntime.spec.volumes). The same applies to the inference count fetch and error handling at the top.Consider extracting the shared PVC-cleanup and secret-cleanup logic into helper functions that both callers can use, passing only the extracted
pvcName,nimSecretName, andimagePullSecretNamevalues.♻️ Sketch of a refactor
+// Shared helper for PVC cleanup decision +const deleteIfDashboardManagedPVC = async ( + pvcName: string, + projectName: string, + currentDeploymentName: string, +): Promise<void | undefined> => { + const pvcUsage = await checkPVCUsage(pvcName, projectName); + if (pvcUsage.count <= 1) { + const pvc = await getPvc(projectName, pvcName); + const hasNewManagedLabel = pvc.metadata.labels?.['opendatahub.io/managed'] === 'true'; + const isDashboardNamingPattern = /^nim-pvc-.+$/.test(pvcName); + if (hasNewManagedLabel || isDashboardNamingPattern) { + await deletePvc(pvcName, projectName); + } + } +};Then both
getNIMResourcesToDeleteandgetNIMOperatorResourcesToDeletewould call this helper with the respective PVC name.
170-186: Fragile error-message check for NIM Operator availability.Line 182 uses
nimError.message.includes('not found')to determine whether the NIM Operator is installed. K8s API error messages vary across cluster versions and configurations. Consider checking the HTTP status code (e.g., 404) if available, or treating all errors fromlistNIMServicesas non-fatal for the PVC usage count.frontend/src/pages/modelServing/screens/global/InferenceServiceTableRow.tsx (1)
60-62: TheuseInferenceServiceDisplayNamehook refetch pattern is safe in this context; consider adding abort logic for robustness.The hook does re-run its
useEffectwhen theinferenceServicereference changes (line 221 dependency), and it lacks cleanup/abort handling. However, since the parent component passes stable references from the table data array (which maintains object identity until actual K8s changes occur), unnecessary refetches are unlikely in practice.For defensive programming, consider adding an
AbortControllerto cancel in-flight requests if the component unmounts or the prop changes before the async call completes. This prevents stale-state updates if the parent ever recreates objects.frontend/src/pages/modelServing/screens/global/nimOperatorUtils.ts (2)
173-199:getInferenceServiceDisplayNameswallows errors silently.The
console.warnon line 190 is acceptable for fallback behavior, but in production this could mask real issues (e.g., permissions, network). Consider whether the caller should be notified of partial failure.
145-163: Addcontainersfield toInferenceServiceKind.spec.predictortype to eliminate unsafeas anycasts.The
InferenceServiceKindtype definition ink8sTypes.ts(lines 522–557) does not include acontainersfield onspec.predictor, forcing NIM Operator–related code to cast toas any. This pattern appears in at least 5 files:nimOperatorUtils.ts,ManageNIMServingModal.tsx,useServingHardwareProfileConfig.ts,ServingRuntimeDetails.tsx, andutils.ts.Extend the
InferenceServiceKindtype to include an optionalcontainersfield:Suggested type extension
// In InferenceServiceKind.spec.predictor containers?: Array<{ name?: string; image?: string; env?: Array<{ name: string; value?: string }>; resources?: ContainerResources; }>;This change would eliminate the need for
as anycasts across multiple files and improve type safety without breaking existing code.frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/ManageNIMServingModal.tsx (1)
398-444: Polling loop can block the UI for up to 30 seconds with no cancellation.The polling loop (30 attempts × 1s) blocks the submit promise chain. During this time the user sees only a spinner with no progress indication, and there's no way to cancel the operation (closing the modal doesn't abort the in-flight polling).
Key concerns:
- No
AbortSignalor cleanup — if the component unmounts (user closes modal), the polling continues andsetDisplayName/state updates on an unmounted component may occur.- The timeout is arbitrary — if the NIM Operator is slow or misconfigured, users get a vague error after 30s.
- No feedback to the user that polling is in progress.
Consider:
- Using a ref-based cancellation flag that's cleared in the modal's cleanup.
- Providing user feedback (e.g., "Waiting for NIM Operator to provision...").
- Alternatively, deferring token auth setup to a separate reconciliation step instead of blocking the submission.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/ManageNIMServingModal.tsx`:
- Around line 406-424: The polling loop that waits for getInferenceService in
the create flow (when createDataInferenceService.tokenAuth is true) blocks the
UI for up to 30s with no user feedback; update ManageNIMServingModal to surface
a polling status and make token-auth setup non-blocking: when
createDataInferenceService.tokenAuth is set, start the existing polling logic in
a background async helper (e.g., spawn a promise from a new function like
waitForInferenceServiceAndHandleResult) instead of awaiting it inline so the
modal can immediately update UI, set actionInProgress true and also set a new
statusMessage state (e.g., "Waiting for NIM Operator to provision
InferenceService...") while polling, update the UI to render that statusMessage
where the spinner is shown, and ensure errors from getInferenceService are
handled and reported to the user (or toasts) after the background task
completes.
🧹 Nitpick comments (15)
frontend/src/pages/modelServing/useInferenceServiceStatus.ts (1)
39-43:useEffectdependency on object reference will fire more often than intended.
inferenceService.status?.modelStatus?.statesis an object, so React's shallow comparison treats every newinferenceServiceinstance (from polling / watch) as a change — even whenactiveModelStateandtargetModelStatehaven't actually changed. This will callrefreshModelPodStatus()on every upstream data refresh, not just on genuine state transitions.Depend on the primitive state strings instead:
Proposed fix
- React.useEffect(() => { - refreshModelPodStatus(); - }, [inferenceService.status?.modelStatus?.states, refreshModelPodStatus]); + const activeModelState = inferenceService.status?.modelStatus?.states?.activeModelState; + const targetModelState = inferenceService.status?.modelStatus?.states?.targetModelState; + + React.useEffect(() => { + refreshModelPodStatus(); + }, [activeModelState, targetModelState, refreshModelPodStatus]);The same object-reference dependency appears in the
isIncorrectlyReportedAsLoadedmemo (line 125) andisNewlyDeployedmemo (line 103). Those only cause unnecessary recomputation (no side effects), so they're less urgent, but you may want to align them for consistency.frontend/src/concepts/hardwareProfiles/useServingHardwareProfileConfig.ts (1)
14-28: Duplicated NIM resource-extraction logic — consider sharing a helper.The
as anycast +containers[0].resourcesaccess is repeated here and ingetInferenceServiceHardwareProfilePaths(inconst.ts) andgetModelNameFromNIMInferenceService(innimOperatorUtils.ts). A small shared helper (e.g.,getNIMPredictorContainer(inferenceService)) that encapsulates the unsafe cast and returns the first container (orundefined) would reduce duplication and confine theanyescape hatch to one place.Not blocking, but worth tracking for a follow-up.
frontend/src/concepts/areas/const.ts (1)
182-191: NIM_SERVICES area entry looks good.The dependency chain (
NIM_SERVICES→NIM_MODEL→K_SERVE) correctly ensures NIM services are only available when both NIM model serving and KServe are enabled.Nit: The comment on line 186 ("This flag is inverted") describes the standard convention for all
disable*flags in this file, not something unique to this flag. Consider removing that line to avoid implying this flag is special.frontend/src/pages/modelServing/screens/projects/nim/__tests__/nimUtils.spec.ts (1)
221-244:...overridesat the end silently replaces the mergedspec.The spread order means when
overridesincludes aspecproperty, the shallow merge on line 241 (...overrides?.spec) is discarded and the wholespecis replaced byoverrides.spec. This doesn't break current tests (callers supply all requiredspecfields), but it makes the helper misleading — it looks like it merges spec fields but actually doesn't.Consider placing
...overridesbefore thespec:key, or restructuring:Suggested fix
const createMockNIMService = (overrides?: Partial<NIMServiceKind>): NIMServiceKind => ({ apiVersion: 'apps.nvidia.com/v1alpha1', kind: 'NIMService', metadata: { name: 'test-nim-service', namespace: projectName, + ...overrides?.metadata, }, spec: { image: { repository: 'nvcr.io/nim/meta/llama-3.1-8b-instruct', tag: '1.8.5', pullSecrets: ['ngc-secret'], }, authSecret: 'nvidia-nim-secrets', storage: { pvc: { name: 'nim-pvc-xyz89', }, }, replicas: 1, ...overrides?.spec, }, - ...overrides, });frontend/src/pages/modelServing/screens/projects/ServingRuntimeDetails.tsx (1)
44-53: Unsafeas anycast on predictor to access NIM Operator containers.The NIM Operator places
containersdirectly on the predictor, which doesn't matchInferenceServiceKind's typed spec. Theas anycast works but hides structural mismatches at compile time. Consider extracting a typed helper (similar togetModelNameFromNIMInferenceServiceinnimOperatorUtils.tswhich uses the same pattern) or defining a shared utility that encapsulates the NIM predictor shape, to keep these casts centralized.frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/NIMPVCSelector.tsx (1)
164-178: Use a PatternFlyButtonwithvariant="link"instead of a styled<button>.The inline-styled
<button>elements (lines 164–178 and 195–209) manually replicate link styling. PatternFly's<Button variant="link">provides consistent theming, accessibility (focus ring, ARIA), and eliminates the inlinestyleobject.frontend/src/k8sTypes.ts (1)
1831-1923: Extract shared probe type to reduce duplication.
livenessProbe,readinessProbe, andstartupProbeall share an identical innerprobestructure (~30 lines each). Consider extracting aNIMServiceProbetype and reusing it.♻️ Suggested refactor
+type NIMServiceProbeSpec = { + httpGet?: { + path?: string; + port: number | string; + host?: string; + scheme?: string; + httpHeaders?: { name: string; value: string }[]; + }; + exec?: { command?: string[] }; + tcpSocket?: { port: number | string; host?: string }; + grpc?: { port: number; service?: string }; + initialDelaySeconds?: number; + periodSeconds?: number; + timeoutSeconds?: number; + failureThreshold?: number; + successThreshold?: number; + terminationGracePeriodSeconds?: number; +}; + +type NIMServiceProbeConfig = { + enabled?: boolean; + probe?: NIMServiceProbeSpec; +};Then use
NIMServiceProbeConfigfor all three probe fields.frontend/src/pages/modelServing/screens/global/InferenceServiceTableRow.tsx (1)
60-62:useInferenceServiceDisplayNametriggers an async fetch per table row.For NIM-managed deployments, each row fires an individual
getNIMServiceAPI call to resolve the display name. In a table with many NIM-managed rows, this could result in a burst of requests and a visible flash from fallback → resolved name. Consider batching or caching NIMService lookups at the table/page level if this becomes a user-facing issue.frontend/src/pages/modelServing/screens/projects/nim/nimUtils.ts (2)
333-449: Significant duplication withgetNIMResourcesToDelete— extract shared PVC deletion logic.
getNIMOperatorResourcesToDelete(lines 333-449) duplicates ~90% of the PVC deletion logic fromgetNIMResourcesToDelete(lines 203-323): the inference count fetch, two-tier PVC ownership check, Dashboard-managed pattern matching, and secret deletion guard. Only the source ofpvcNameand secret names differs.Consider extracting the shared logic into a helper, e.g.:
Sketch of a shared helper
+async function deleteNIMPVCIfUnused( + pvcName: string, + projectName: string, + currentDeploymentName: string, +): Promise<Promise<void>[]> { + // shared two-tier PVC check + deletion logic +} + +async function deleteNIMSecretsIfLast( + projectName: string, + nimSecretName: string | undefined, + imagePullSecretName: string | undefined, + inferenceCount: number, +): Promise<Promise<void>[]> { + // shared secret deletion guard +}This would reduce maintenance burden and ensure both paths stay in sync as the logic evolves.
170-186: NIM Operator not-installed error handling looks reasonable but consider tightening the check.The catch block on line 182 checks
nimError.message.includes('not found')to suppress expected errors when the NIM Operator CRD isn't installed. However, a 404 for a specific NIMService resource would also contain "not found" and be silently swallowed. Since this is a list operation that would fail with a CRD-not-found error, this is likely fine in practice, but a more specific check (e.g., matching on HTTP status code or CRD-related error patterns) would be more robust.frontend/src/pages/modelServing/screens/global/__tests__/nimOperatorUtils.spec.ts (2)
319-367:filterNIMSystemEnvVarstests don't cover all 7 system env vars.The "should filter out system-managed NIM environment variables" test only includes
NIM_CACHE_PATH,NGC_API_KEY, andNIM_SERVER_PORT, but the implementation'sNIM_SYSTEM_ENV_VARSlist also includesOUTLINES_CACHE_DIR,NIM_HTTP_API_PORT,NIM_JSONL_LOGGING, andNIM_LOG_LEVEL. If any of those entries were accidentally removed from the constant, these tests would still pass.Consider adding them to the test input to ensure full coverage of the filter list.
1-368: Good coverage for pure utilities; consider adding tests forgetInferenceServiceDisplayNameanduseInferenceServiceDisplayName.The async function
getInferenceServiceDisplayNameand the React hookuseInferenceServiceDisplayNameare not covered. These have non-trivial logic (NIMService fetch with fallback, stale-guard in the effect) that would benefit from unit tests, especially the error/fallback path.frontend/src/pages/modelServing/screens/global/nimOperatorUtils.ts (3)
20-38: Minor: redundantv1alpha1exact match.Line 26 checks
ref.apiVersion === 'apps.nvidia.com/v1alpha1'but line 27 already covers it withref.apiVersion.startsWith('apps.nvidia.com/'). The explicit check is redundant but harmless — it may serve as documentation of the currently expected version.
62-70: Consider keepingNIM_SYSTEM_ENV_VARSas aSetfor O(1) lookups.Currently
NIM_SYSTEM_ENV_VARSis an array, andfilterNIMSystemEnvVarsuses.includes()which is O(n) per check. With 7 entries this is negligible, but aSetwould be more idiomatic for membership checks and scales better if the list grows.
145-163: Theas anycast is necessary but fragile — document the NIM Operator predictor schema deviation.The NIM Operator creates InferenceServices with
spec.predictor.containers[]instead of the standardspec.predictor.model. This is a schema deviation that isn't captured byInferenceServiceKind. A type extension or a comment referencing the NIM Operator documentation would help future maintainers understand why this cast is needed.
|
@coderabbitai resolve |
✅ Actions performedComments resolved. Auto-approval is disabled; enable |
|
The failing tests are not related to my changes. |
andrewballantyne
left a comment
There was a problem hiding this comment.
I'm partially way through this -- can you help me understand some of these things?
…otations` is working perfectly!
… package to frontend wrapper to maintain clean package separation
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
This PR is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
https://issues.redhat.com/browse/NVPE-314
https://issues.redhat.com/browse/NVPE-315
Description
This PR integrates NVIDIA NIM Operator support into the OpenShift AI Dashboard. When an admin enables the
nimOperatorIntegration: truefeature flag, the Dashboard creates NIMService custom resources instead of manually provisioning ServingRuntime + InferenceService pairs. The NIM Operator then reconciles the NIMService into an InferenceService automatically.Full info is here.
Blockers
How Has This Been Tested?
Tested locally
Test Impact
This guide provides instructions for testing the NIM Operator integration in ODH Dashboard.
Prerequisites
OpenShift cluster with NIM Operator installed
Access to the cluster via oc CLI
Setup Steps
1. Enable NIM Services
In the dashboard configuration
odh-dashboar-org/backend/src/utils/constants.ts,changenimOperatorIntegrationfrom false to true:2. Start Backend Server
Open a terminal and run:
3. Start Frontend Development Server
Open a second terminal and run:
4. Verify Configuration
Once both servers are running, verify the configuration by checking:
Expected result: The response should include
nimOperatorIntegration: trueNext Steps
With the development environment running and NIM services enabled, you can now:
Navigate to the model serving section in the dashboard
Test NIM model deployment features
Verify NIM Operator integration functionality
Request review criteria:
Self checklist (all need to be checked):
If you have UI changes:
After the PR is posted & before it merges:
mainSummary by CodeRabbit
New Features
UI
Tests