Skip to content

Integrate NVIDIA NIM Operator into RHOAI#6175

Open
mtalvi wants to merge 27 commits intoopendatahub-io:mainfrom
mtalvi:nim-operator
Open

Integrate NVIDIA NIM Operator into RHOAI#6175
mtalvi wants to merge 27 commits intoopendatahub-io:mainfrom
mtalvi:nim-operator

Conversation

@mtalvi
Copy link
Copy Markdown
Contributor

@mtalvi mtalvi commented Feb 5, 2026

https://issues.redhat.com/browse/NVPE-314
https://issues.redhat.com/browse/NVPE-315

Description

This PR integrates NVIDIA NIM Operator support into the OpenShift AI Dashboard. When an admin enables the nimOperatorIntegration: true feature flag, the Dashboard creates NIMService custom resources instead of manually provisioning ServingRuntime + InferenceService pairs. The NIM Operator then reconciles the NIMService into an InferenceService automatically.
Full info is here.

Blockers

  1. This PR is dependent on this PR by @TomerFi getting merged.
  2. The NVIDIA NIM Operator installed from OLM (using the main tag image, which must include PR #739 and PR #742)

How Has This Been Tested?

Tested locally

Test Impact

This guide provides instructions for testing the NIM Operator integration in ODH Dashboard.

Prerequisites

OpenShift cluster with NIM Operator installed
Access to the cluster via oc CLI

Setup Steps

1. Enable NIM Services

In the dashboard configuration odh-dashboar-org/backend/src/utils/constants.ts, change nimOperatorIntegration from false to true:

nimOperatorIntegration: true

2. Start Backend Server

Open a terminal and run:

cd odh-dashboar-org/backend
npm run build
OC_PROJECT=redhat-ods-applications DASHBOARDCONFIG=odh-dashboard-config npm run start:dev

3. Start Frontend Development Server

Open a second terminal and run:

cd odh-dashboar-org/frontend
npm run start:dev

4. Verify Configuration

Once both servers are running, verify the configuration by checking:

curl http://localhost:4010/api/config

Expected result: The response should include nimOperatorIntegration: true

Next Steps

With the development environment running and NIM services enabled, you can now:

Navigate to the model serving section in the dashboard
Test NIM model deployment features
Verify NIM Operator integration functionality

Request review criteria:

Self checklist (all need to be checked):

  • The developer has manually tested the changes and verified that the changes work
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has added tests or explained why testing cannot be added (unit or cypress tests for related changes)
  • The code follows our Best Practices (React coding standards, PatternFly usage, performance considerations)

If you have UI changes:

  • Included any necessary screenshots or gifs if it was a UI change.
  • Included tags to the UX team if it was a UI/UX change.

After the PR is posted & before it merges:

  • The developer has tested their solution on a cluster by using the image produced by the PR to main

Summary by CodeRabbit

  • New Features

    • NVIDIA NIM Operator support: create, update, delete NIMService deployments (token auth, external routing, PVC handling) and a feature flag to enable/disable NIM Services.
  • UI

    • Dashboard displays “NVIDIA NIM” / Operator-managed labels; forms, validation, model/path inputs and deletion flows adapt for operator vs legacy deployments; hardware-profile handling updated for NIM-managed workloads.
  • Tests

    • Extensive coverage added for NIM utilities, PVC discovery, resource cleanup, and related UI rendering.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds NVIDIA NIM Operator support across backend and frontend. Backend: new DashboardConfig flag disableNIMServices. Frontend: new NIMServiceKind types, API client (list/get/create/update/delete), model declaration, extensive mocks and tests, nimOperator utilities and hooks (nimOperatorUtils, useNIMServicesEnabled, useNIMCompatiblePVCs), UI branches and prop changes to treat NIM-managed InferenceServices differently, hardware-profile path selection for NIM-managed services, NIM-specific PVC/secret cleanup logic, and CRD/manifest updates exposing the new feature flag.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: integrating NVIDIA NIM Operator into RHOAI, which aligns with the primary objective of this large changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description includes all required sections: issue references, detailed description of changes, testing instructions, and a mostly-complete self-checklist covering critical items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mtalvi mtalvi marked this pull request as ready for review February 11, 2026 11:48
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress This PR is in WIP state label Feb 11, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/ManageNIMServingModal.tsx (1)

474-493: ⚠️ Potential issue | 🔴 Critical

Stale state: storage override for existing PVC is never applied to the submission.

setCreateDataInferenceService('storage', ...) on line 476 enqueues a React state update, but the createDataInferenceService reference in this closure still holds the old value. When getSubmitInferenceServiceResourceFn(createDataInferenceService, ...) is called on line 484, it receives the pre-update state — the storage override for use-existing PVC mode is silently ignored.

Proposed fix: construct the data object directly instead of using the React state setter
+    let inferenceServiceData = createDataInferenceService;
     if (pvcMode === 'use-existing') {
       // For existing PVC, configure storage to use local path instead of remote URI
-      setCreateDataInferenceService('storage', {
+      inferenceServiceData = {
+        ...createDataInferenceService,
+        storage: {
-        type: InferenceServiceStorageType.EXISTING_URI,
-        path: modelPath,
-        dataConnection: '',
-        uri: modelPath,
-        awsData: EMPTY_AWS_SECRET_DATA,
-      });
+          type: InferenceServiceStorageType.EXISTING_URI,
+          path: modelPath,
+          dataConnection: '',
+          uri: modelPath,
+          awsData: EMPTY_AWS_SECRET_DATA,
+        },
+      };
     }
     const submitInferenceServiceResource = getSubmitInferenceServiceResourceFn(
-      createDataInferenceService,
+      inferenceServiceData,
       editInfo?.inferenceServiceEditInfo,
🤖 Fix all issues with AI agents
In `@frontend/src/api/k8s/nimServices.ts`:
- Around line 200-228: assembleNIMService currently clones an existing
nimService and only sets optional fields when new values are present, leaving
stale values in nimService.spec (e.g., spec.resources, spec.env,
spec.nodeSelector, spec.tolerations) when the user clears them; update
assembleNIMService to explicitly remove/clear these fields when their inputs are
absent (for example, if resources is falsy delete nimService.spec.resources, if
envVars.length === 0 delete nimService.spec.env, if !nodeSelector ||
Object.keys(nodeSelector).length === 0 delete nimService.spec.nodeSelector, and
if !tolerations || tolerations.length === 0 delete nimService.spec.tolerations)
so the edited NIMService no longer retains old configuration.

In `@frontend/src/concepts/hardwareProfiles/const.ts`:
- Line 2: Update the import to follow project convention by removing the .ts
extension: change the import that currently reads importing
HardwareProfileFeatureVisibility and InferenceServiceKind from '#~/k8sTypes.ts'
so it imports from '#~/k8sTypes' instead; modify the import statement that
references k8sTypes accordingly so other modules remain consistent.

In `@frontend/src/pages/modelServing/screens/global/nimOperatorUtils.ts`:
- Around line 209-224: The hook useInferenceServiceDisplayName re-triggers on
unstable object references and allows stale async results to overwrite state;
change the effect to depend on stable primitive keys (e.g.,
inferenceService.metadata.name and inferenceService.metadata.namespace or
another unique id returned by getDisplayNameFromK8sResource) instead of the
entire inferenceService object, and add an abort/stale-guard inside the effect
(e.g., a cancelled flag or incrementing request token checked before calling
setDisplayName, and cleaned up in the effect's return) so that only the latest
getInferenceServiceDisplayName result updates state; keep the initial state from
getDisplayNameFromK8sResource and reference getInferenceServiceDisplayName,
useInferenceServiceDisplayName, and setDisplayName when locating the code to
modify.

In
`@frontend/src/pages/modelServing/screens/projects/nim/__tests__/nimUtils.spec.ts`:
- Around line 392-410: The test uses inconsistent apiVersion values for
NIMService mocks: update the inline anotherNIMService to match
createMockNIMService()'s apiVersion ('apps.nvidia.com/v1alpha1') so all test
fixtures are consistent; locate the anotherNIMService object in nimUtils.spec.ts
and change its apiVersion to 'apps.nvidia.com/v1alpha1' (also review other mocks
returned by listNIMServices to ensure they all use the same apiVersion).

In
`@frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/ManageNIMServingModal.tsx`:
- Around line 535-546: The PVC description is read using the wrong annotation
key in ManageNIMServingModal.tsx causing openshift.io/description to be lost on
update; change the extraction of the description when building updatePvcData to
read pvc.metadata.annotations?.['openshift.io/description'] (or fallback to ''
only if that specific key is missing) so updatePvc(...) receives the correct
description and updatePvc (frontend/src/api/k8s/pvcs.ts) no longer clears the
openshift.io/description annotation; ensure you update the reference in the
block where updatePvcData is constructed inside ManageNIMServingModal (around
the conditional that compares pvc.spec.resources.requests.storage to pvcSize).
- Around line 326-328: The current parsing of createDataServingRuntime.imageName
using split(':') incorrectly breaks images with registry ports; update the logic
that sets imageRepository and imageTag in ManageNIMServingModal (the variables
imageRepository and imageTag) to split on the last colon only: find the
lastIndexOf(':') on createDataServingRuntime.imageName, if no colon treat the
whole string as repository and default tag to 'latest', otherwise take substring
before the last colon as imageRepository and substring after as imageTag; ensure
empty/null checks for createDataServingRuntime.imageName are preserved.
- Around line 378-380: The code is using unnecessary dynamic imports for
getNIMService and getInferenceService; statically import them instead alongside
the already imported createNIMService and updateNIMService from the same module.
Update the import block that currently includes createNIMService and
updateNIMService to also import getNIMService and getInferenceService, then
remove the dynamic import calls and directly call getNIMService(nimServiceName,
namespace) and getInferenceService(inferenceServiceName, namespace) where they
are used (lines around the existing dynamic imports).
- Around line 507-511: The submitNIMService path is missing the InferenceService
restart that submitLegacyMode performs: after the NIMService update (the
NIMService patch/update call in submitNIMService) add the same logic used in
submitLegacyMode to detect editInfo?.inferenceServiceEditInfo and call await
patchInferenceServiceStoppedStatus(editInfo.inferenceServiceEditInfo, 'false')
so the InferenceService is restarted on NIM Operator edits; ensure the call is
awaited and any existing error handling pattern around
patchInferenceServiceStoppedStatus is followed for consistency.

In
`@frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/NIMPVCSelector.tsx`:
- Around line 63-79: The effect that forces modelPath should only run when the
mode changes, not on every modelPath change; update the first React.useEffect to
depend only on nimServicesEnabled (remove modelPath and setModelPath from its
dependency array) and keep the conditional setModelPath(correctPath) so it runs
once on mode switch; remove the second redundant effect that auto-sets when
existingPvcName is present; if the intended behavior is to lock the input in
both modes instead, ensure the input's isDisabled prop (currently tied to
nimServicesEnabled) is updated accordingly instead of changing the effects.

In
`@frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/useNIMCompatiblePVCs.ts`:
- Around line 72-84: The function extractPVCFromNIMService contains a redundant
startsWith('nim-pvc') branch that returns the same pvcName in both branches;
simplify by removing the conditional and directly return the pvc name (or null
if missing) — update the function extractPVCFromNIMService to compute pvcName
from nimService.spec.storage?.pvc?.name, return null when falsy, otherwise
return pvcName without the startsWith check.
- Around line 106-116: The code in parseNimModelFromImage (variable parsed) uses
parsed.lastIndexOf('--') to strip a namespace prefix which can mis-handle model
names that themselves contain '--'; change the logic to use parsed.indexOf('--')
(first occurrence) to detect the namespace separator and then return
parsed.substring(firstIndex + 2) when indexOf('--') !== -1, otherwise return
parsed as-is; update the block that currently references lastIndexOf('--') to
use indexOf('--') so namespace stripping only removes the leading segment.

In `@frontend/src/pages/modelServing/screens/projects/useNIMServicesEnabled.ts`:
- Around line 30-40: Delete the contradictory block of comments that claims the
logic needs inversion then immediately contradicts itself (the lines that
mention "invert the logic" and the subsequent correction about
useIsAreaAvailable), leaving only the concise explanation that
useIsAreaAvailable already yields status: true when NIM services are enabled;
keep the existing JSDoc/comments that describe disableNIMServices semantics
(references: useIsAreaAvailable, disableNIMServices, NIM Services) so the file
no longer contains confusing leftover debugging notes.

In `@frontend/src/pages/modelServing/screens/projects/utils.ts`:
- Around line 196-206: The effect calling getInferenceServiceDisplayName in
React.useEffect (dependent on existingData and isNIMManaged) can update state
after the component has re-rendered with new props; add a local "isMounted" (or
"cancelled") boolean flag inside the effect, set it true initially and flip to
false in the cleanup, then only call setExistingName(name) when the flag
indicates the effect is still current; ensure the .catch branch also checks the
flag before any state or side-effect and leave the console.error for debugging.
🧹 Nitpick comments (17)
frontend/src/pages/modelServing/useInferenceServiceStatus.ts (2)

46-70: Stale closure: modelPod inside the setInterval callback never reflects updates.

The modelPod captured by the interval callback is the value from when the effect last ran. The if (!modelPod) check on line 50 cannot observe the result of await refreshModelPodStatus() — it always sees the stale closure value. The polling happens to work because modelPod is in the dependency array, so when it changes React re-runs the effect, but the explicit check inside the callback is misleading and never the thing that actually stops polling.

Consider using a ref to track the latest modelPod, or restructuring so the stop condition is evaluated outside the interval:

♻️ Suggested approach using a ref
+  const modelPodRef = React.useRef(modelPod);
+  modelPodRef.current = modelPod;
+
   // Manual polling when isStopping is true
   React.useEffect(() => {
     if (isStopping) {
       const interval = setInterval(async () => {
         await refreshModelPodStatus();
-        if (!modelPod) {
+        if (!modelPodRef.current) {
           setIsStopping(false);
         }
       }, FAST_POLL_INTERVAL);

39-43: Object-reference dependency may cause unnecessary refresh calls.

inferenceService.status?.modelStatus?.states is an object. If the parent re-renders with a new inferenceService object (common with k8s watch/polling), this effect fires even when the state values haven't changed, triggering a redundant refreshModelPodStatus() call each time.

Consider depending on the specific primitive values (activeModelState, targetModelState) instead:

♻️ Suggested fix
   React.useEffect(() => {
     refreshModelPodStatus();
-  }, [inferenceService.status?.modelStatus?.states, refreshModelPodStatus]);
+  }, [
+    inferenceService.status?.modelStatus?.states?.activeModelState,
+    inferenceService.status?.modelStatus?.states?.targetModelState,
+    refreshModelPodStatus,
+  ]);
backend/src/utils/constants.ts (1)

81-81: Consider rephrasing the inline comment.

The comment "Set to false for testing NIM Operator integration" reads as developer instructions rather than documenting the flag's purpose. Other disable* flags in this block don't have such comments. Consider either removing it or rephrasing to describe what the flag controls (e.g., // NIM Operator integration - creates NIMService instead of InferenceService when false), matching the style used in the manifest.

packages/kserve/src/modelFormat.ts (1)

2-6: Cross-package import with lint suppression warrants attention.

The eslint-disable for @odh-dashboard/no-restricted-imports suggests this import violates the intended package boundary between packages/kserve and @odh-dashboard/internal. Consider whether isNIMOperatorManaged and getModelNameFromNIMInferenceService should be exposed through a shared utility package or a proper public API surface of the internal package, rather than suppressing the lint rule.

If this is intentional and temporary (given the WIP nature of this PR), a // TODO explaining the plan to resolve this would be helpful.

frontend/src/pages/projects/screens/detail/overview/serverModels/deployedModels/DeployedModelCard.tsx (1)

39-40: Consider optimizing the useInferenceServiceDisplayName hook's dependency array.

The hook uses [inferenceService] as its useEffect dependency. Since this is an object reference, if the parent component creates a new reference on each render (common when fetching K8s resources), the effect will unnecessarily re-trigger and refetch the display name. Update the dependency to use a stable identifier instead, such as inferenceService.metadata.uid combined with inferenceService.metadata.namespace, to prevent unnecessary async calls.

frontend/src/concepts/hardwareProfiles/useServingHardwareProfileConfig.ts (1)

19-28: Duplicated as any cast for NIM predictor containers — consider a shared accessor.

The same as any cast pattern to access predictor.containers[0].resources appears here, in getModelNameFromNIMInferenceService (nimOperatorUtils.ts lines 149-150), and in getInferenceServiceHardwareProfilePaths (const.ts). A single typed helper (e.g., getNIMPredictorResources(inferenceService)) in nimOperatorUtils.ts would eliminate the repeated unsafe casts and keep the NIM-specific predictor shape in one place.

frontend/src/k8sTypes.ts (1)

1631-1996: Well-documented CRD type definition; consider extracting the repeated probe shape.

The liveness, readiness, and startup probe blocks (lines 1832–1923) share an identical structure. Extracting a NIMServiceProbeConfig type would reduce ~90 lines of duplication.

♻️ Suggested type extraction
+type NIMServiceProbeConfig = {
+  enabled?: boolean;
+  probe?: {
+    httpGet?: {
+      path?: string;
+      port: number | string;
+      host?: string;
+      scheme?: string;
+      httpHeaders?: { name: string; value: string }[];
+    };
+    exec?: { command?: string[] };
+    tcpSocket?: { port: number | string; host?: string };
+    grpc?: { port: number; service?: string };
+    initialDelaySeconds?: number;
+    periodSeconds?: number;
+    timeoutSeconds?: number;
+    failureThreshold?: number;
+    successThreshold?: number;
+    terminationGracePeriodSeconds?: number;
+  };
+};
+
 // Then in NIMServiceKind.spec:
-    livenessProbe?: { ... };
-    readinessProbe?: { ... };
-    startupProbe?: { ... };
+    livenessProbe?: NIMServiceProbeConfig;
+    readinessProbe?: NIMServiceProbeConfig;
+    startupProbe?: NIMServiceProbeConfig;
frontend/src/pages/modelServing/screens/projects/ServingRuntimeDetails.tsx (1)

44-53: Consider extracting a shared utility for NIM predictor resource extraction instead of duplicating as any casts.

The as any cast on isvc.spec.predictor to access containers[0].resources appears here and in utils.ts (line 257). A shared utility (e.g., getNIMPredictorResources(isvc)) would centralize the unsafe cast and make it easier to replace once InferenceServiceKind types are extended for NIM operator predictor shapes.

♻️ Example utility
// In nimOperatorUtils.ts
export const getNIMPredictorContainerResources = (
  inferenceService: InferenceServiceKind,
): ContainerResources | undefined => {
  // eslint-disable-next-line `@typescript-eslint/consistent-type-assertions`, `@typescript-eslint/no-explicit-any`
  const predictor = inferenceService.spec.predictor as any;
  return predictor?.containers?.[0]?.resources;
};

Then in ServingRuntimeDetails.tsx:

-  if (isNIMManaged) {
-    // NIM Operator uses containers instead of model spec
-    // eslint-disable-next-line `@typescript-eslint/consistent-type-assertions`, `@typescript-eslint/no-explicit-any`
-    const predictor = isvc.spec.predictor as any;
-    resources = predictor?.containers?.[0]?.resources;
-  } else {
+  if (isNIMManaged) {
+    resources = getNIMPredictorContainerResources(isvc);
+  } else {
frontend/src/pages/modelServing/screens/projects/utils.ts (1)

247-264: Duplicated as any cast on predictor for NIM container access.

Same pattern as ServingRuntimeDetails.tsx. Consolidating into a shared utility (e.g., getNIMPredictorContainerEnvVars) would reduce duplication and isolate the unsafe cast.

frontend/src/pages/modelServing/screens/global/DeleteInferenceServiceModal.tsx (1)

54-86: NIM Operator deletion path looks correct overall, but the isKServeNIMEnabled guard may be overly restrictive.

If an InferenceService has a NIMService ownerReference (line 57), NIM is definitively in use in that project. The isKServeNIMEnabled && project check on line 65 could skip resource cleanup for NIM Operator deployments in projects where the annotation opendatahub.io/nim-support is missing (e.g., stale projects, migration scenarios). This would leave orphaned PVCs/secrets behind silently, with only a skipped code path — no warning logged.

Consider whether nimServiceOwner being present is sufficient to proceed with cleanup, or at minimum log a warning when skipping.

frontend/src/__mocks__/mockNimService.ts (1)

72-93: When pvcName is undefined, spec.storage.pvc still includes name: undefined.

If a caller passes pvcName: undefined (like the test at useNIMCompatiblePVCs.spec.ts line 228), the resulting object has storage: { pvc: { name: undefined } } rather than omitting the pvc key entirely. This doesn't match a real NIMService without PVC storage. The test at line 232 works around this by manually overriding spec.storage = {}, but it would be cleaner if the mock conditionally omitted the pvc block when pvcName is not provided.

♻️ Suggested improvement
   storage: {
-    pvc: {
-      name: pvcName,
-      ...(pvcSubPath && { subPath: pvcSubPath }),
-    },
+    ...(pvcName != null && {
+      pvc: {
+        name: pvcName,
+        ...(pvcSubPath && { subPath: pvcSubPath }),
+      },
+    }),
   },
frontend/src/pages/modelServing/screens/projects/nim/nimUtils.ts (2)

333-449: Significant code duplication: getNIMOperatorResourcesToDelete duplicates ~80% of getNIMResourcesToDelete.

The PVC deletion logic (reference counting, two-tier Dashboard-managed check, BYO-PVC preservation) is copy-pasted between the two functions (compare lines 357-418 with lines 227-295). The only difference is how the PVC name is extracted (from nimService.spec.storage?.pvc?.name vs. servingRuntime.spec.volumes). The same applies to the inference count fetch and error handling at the top.

Consider extracting the shared PVC-cleanup and secret-cleanup logic into helper functions that both callers can use, passing only the extracted pvcName, nimSecretName, and imagePullSecretName values.

♻️ Sketch of a refactor
+// Shared helper for PVC cleanup decision
+const deleteIfDashboardManagedPVC = async (
+  pvcName: string,
+  projectName: string,
+  currentDeploymentName: string,
+): Promise<void | undefined> => {
+  const pvcUsage = await checkPVCUsage(pvcName, projectName);
+  if (pvcUsage.count <= 1) {
+    const pvc = await getPvc(projectName, pvcName);
+    const hasNewManagedLabel = pvc.metadata.labels?.['opendatahub.io/managed'] === 'true';
+    const isDashboardNamingPattern = /^nim-pvc-.+$/.test(pvcName);
+    if (hasNewManagedLabel || isDashboardNamingPattern) {
+      await deletePvc(pvcName, projectName);
+    }
+  }
+};

Then both getNIMResourcesToDelete and getNIMOperatorResourcesToDelete would call this helper with the respective PVC name.


170-186: Fragile error-message check for NIM Operator availability.

Line 182 uses nimError.message.includes('not found') to determine whether the NIM Operator is installed. K8s API error messages vary across cluster versions and configurations. Consider checking the HTTP status code (e.g., 404) if available, or treating all errors from listNIMServices as non-fatal for the PVC usage count.

frontend/src/pages/modelServing/screens/global/InferenceServiceTableRow.tsx (1)

60-62: The useInferenceServiceDisplayName hook refetch pattern is safe in this context; consider adding abort logic for robustness.

The hook does re-run its useEffect when the inferenceService reference changes (line 221 dependency), and it lacks cleanup/abort handling. However, since the parent component passes stable references from the table data array (which maintains object identity until actual K8s changes occur), unnecessary refetches are unlikely in practice.

For defensive programming, consider adding an AbortController to cancel in-flight requests if the component unmounts or the prop changes before the async call completes. This prevents stale-state updates if the parent ever recreates objects.

frontend/src/pages/modelServing/screens/global/nimOperatorUtils.ts (2)

173-199: getInferenceServiceDisplayName swallows errors silently.

The console.warn on line 190 is acceptable for fallback behavior, but in production this could mask real issues (e.g., permissions, network). Consider whether the caller should be notified of partial failure.


145-163: Add containers field to InferenceServiceKind.spec.predictor type to eliminate unsafe as any casts.

The InferenceServiceKind type definition in k8sTypes.ts (lines 522–557) does not include a containers field on spec.predictor, forcing NIM Operator–related code to cast to as any. This pattern appears in at least 5 files: nimOperatorUtils.ts, ManageNIMServingModal.tsx, useServingHardwareProfileConfig.ts, ServingRuntimeDetails.tsx, and utils.ts.

Extend the InferenceServiceKind type to include an optional containers field:

Suggested type extension
// In InferenceServiceKind.spec.predictor
containers?: Array<{
  name?: string;
  image?: string;
  env?: Array<{ name: string; value?: string }>;
  resources?: ContainerResources;
}>;

This change would eliminate the need for as any casts across multiple files and improve type safety without breaking existing code.

frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/ManageNIMServingModal.tsx (1)

398-444: Polling loop can block the UI for up to 30 seconds with no cancellation.

The polling loop (30 attempts × 1s) blocks the submit promise chain. During this time the user sees only a spinner with no progress indication, and there's no way to cancel the operation (closing the modal doesn't abort the in-flight polling).

Key concerns:

  • No AbortSignal or cleanup — if the component unmounts (user closes modal), the polling continues and setDisplayName/state updates on an unmounted component may occur.
  • The timeout is arbitrary — if the NIM Operator is slow or misconfigured, users get a vague error after 30s.
  • No feedback to the user that polling is in progress.

Consider:

  1. Using a ref-based cancellation flag that's cleared in the modal's cleanup.
  2. Providing user feedback (e.g., "Waiting for NIM Operator to provision...").
  3. Alternatively, deferring token auth setup to a separate reconciliation step instead of blocking the submission.

Comment thread frontend/src/api/k8s/nim/nimServices.ts
Comment thread frontend/src/concepts/hardwareProfiles/const.ts
Comment thread frontend/src/pages/modelServing/screens/projects/nim/__tests__/nimUtils.spec.ts Outdated
Comment thread frontend/src/pages/modelServing/screens/projects/nim/useNIMServicesEnabled.ts Outdated
Comment thread frontend/src/pages/modelServing/screens/projects/utils.ts
Comment thread packages/kserve/src/modelFormat.ts Outdated
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from emilys314. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/ManageNIMServingModal.tsx`:
- Around line 406-424: The polling loop that waits for getInferenceService in
the create flow (when createDataInferenceService.tokenAuth is true) blocks the
UI for up to 30s with no user feedback; update ManageNIMServingModal to surface
a polling status and make token-auth setup non-blocking: when
createDataInferenceService.tokenAuth is set, start the existing polling logic in
a background async helper (e.g., spawn a promise from a new function like
waitForInferenceServiceAndHandleResult) instead of awaiting it inline so the
modal can immediately update UI, set actionInProgress true and also set a new
statusMessage state (e.g., "Waiting for NIM Operator to provision
InferenceService...") while polling, update the UI to render that statusMessage
where the spinner is shown, and ensure errors from getInferenceService are
handled and reported to the user (or toasts) after the background task
completes.
🧹 Nitpick comments (15)
frontend/src/pages/modelServing/useInferenceServiceStatus.ts (1)

39-43: useEffect dependency on object reference will fire more often than intended.

inferenceService.status?.modelStatus?.states is an object, so React's shallow comparison treats every new inferenceService instance (from polling / watch) as a change — even when activeModelState and targetModelState haven't actually changed. This will call refreshModelPodStatus() on every upstream data refresh, not just on genuine state transitions.

Depend on the primitive state strings instead:

Proposed fix
-  React.useEffect(() => {
-    refreshModelPodStatus();
-  }, [inferenceService.status?.modelStatus?.states, refreshModelPodStatus]);
+  const activeModelState = inferenceService.status?.modelStatus?.states?.activeModelState;
+  const targetModelState = inferenceService.status?.modelStatus?.states?.targetModelState;
+
+  React.useEffect(() => {
+    refreshModelPodStatus();
+  }, [activeModelState, targetModelState, refreshModelPodStatus]);

The same object-reference dependency appears in the isIncorrectlyReportedAsLoaded memo (line 125) and isNewlyDeployed memo (line 103). Those only cause unnecessary recomputation (no side effects), so they're less urgent, but you may want to align them for consistency.

frontend/src/concepts/hardwareProfiles/useServingHardwareProfileConfig.ts (1)

14-28: Duplicated NIM resource-extraction logic — consider sharing a helper.

The as any cast + containers[0].resources access is repeated here and in getInferenceServiceHardwareProfilePaths (in const.ts) and getModelNameFromNIMInferenceService (in nimOperatorUtils.ts). A small shared helper (e.g., getNIMPredictorContainer(inferenceService)) that encapsulates the unsafe cast and returns the first container (or undefined) would reduce duplication and confine the any escape hatch to one place.

Not blocking, but worth tracking for a follow-up.

frontend/src/concepts/areas/const.ts (1)

182-191: NIM_SERVICES area entry looks good.

The dependency chain (NIM_SERVICESNIM_MODELK_SERVE) correctly ensures NIM services are only available when both NIM model serving and KServe are enabled.

Nit: The comment on line 186 ("This flag is inverted") describes the standard convention for all disable* flags in this file, not something unique to this flag. Consider removing that line to avoid implying this flag is special.

frontend/src/pages/modelServing/screens/projects/nim/__tests__/nimUtils.spec.ts (1)

221-244: ...overrides at the end silently replaces the merged spec.

The spread order means when overrides includes a spec property, the shallow merge on line 241 (...overrides?.spec) is discarded and the whole spec is replaced by overrides.spec. This doesn't break current tests (callers supply all required spec fields), but it makes the helper misleading — it looks like it merges spec fields but actually doesn't.

Consider placing ...overrides before the spec: key, or restructuring:

Suggested fix
 const createMockNIMService = (overrides?: Partial<NIMServiceKind>): NIMServiceKind => ({
   apiVersion: 'apps.nvidia.com/v1alpha1',
   kind: 'NIMService',
   metadata: {
     name: 'test-nim-service',
     namespace: projectName,
+    ...overrides?.metadata,
   },
   spec: {
     image: {
       repository: 'nvcr.io/nim/meta/llama-3.1-8b-instruct',
       tag: '1.8.5',
       pullSecrets: ['ngc-secret'],
     },
     authSecret: 'nvidia-nim-secrets',
     storage: {
       pvc: {
         name: 'nim-pvc-xyz89',
       },
     },
     replicas: 1,
     ...overrides?.spec,
   },
-  ...overrides,
 });
frontend/src/pages/modelServing/screens/projects/ServingRuntimeDetails.tsx (1)

44-53: Unsafe as any cast on predictor to access NIM Operator containers.

The NIM Operator places containers directly on the predictor, which doesn't match InferenceServiceKind's typed spec. The as any cast works but hides structural mismatches at compile time. Consider extracting a typed helper (similar to getModelNameFromNIMInferenceService in nimOperatorUtils.ts which uses the same pattern) or defining a shared utility that encapsulates the NIM predictor shape, to keep these casts centralized.

frontend/src/pages/modelServing/screens/projects/nim/NIMServiceModal/NIMPVCSelector.tsx (1)

164-178: Use a PatternFly Button with variant="link" instead of a styled <button>.

The inline-styled <button> elements (lines 164–178 and 195–209) manually replicate link styling. PatternFly's <Button variant="link"> provides consistent theming, accessibility (focus ring, ARIA), and eliminates the inline style object.

frontend/src/k8sTypes.ts (1)

1831-1923: Extract shared probe type to reduce duplication.

livenessProbe, readinessProbe, and startupProbe all share an identical inner probe structure (~30 lines each). Consider extracting a NIMServiceProbe type and reusing it.

♻️ Suggested refactor
+type NIMServiceProbeSpec = {
+  httpGet?: {
+    path?: string;
+    port: number | string;
+    host?: string;
+    scheme?: string;
+    httpHeaders?: { name: string; value: string }[];
+  };
+  exec?: { command?: string[] };
+  tcpSocket?: { port: number | string; host?: string };
+  grpc?: { port: number; service?: string };
+  initialDelaySeconds?: number;
+  periodSeconds?: number;
+  timeoutSeconds?: number;
+  failureThreshold?: number;
+  successThreshold?: number;
+  terminationGracePeriodSeconds?: number;
+};
+
+type NIMServiceProbeConfig = {
+  enabled?: boolean;
+  probe?: NIMServiceProbeSpec;
+};

Then use NIMServiceProbeConfig for all three probe fields.

frontend/src/pages/modelServing/screens/global/InferenceServiceTableRow.tsx (1)

60-62: useInferenceServiceDisplayName triggers an async fetch per table row.

For NIM-managed deployments, each row fires an individual getNIMService API call to resolve the display name. In a table with many NIM-managed rows, this could result in a burst of requests and a visible flash from fallback → resolved name. Consider batching or caching NIMService lookups at the table/page level if this becomes a user-facing issue.

frontend/src/pages/modelServing/screens/projects/nim/nimUtils.ts (2)

333-449: Significant duplication with getNIMResourcesToDelete — extract shared PVC deletion logic.

getNIMOperatorResourcesToDelete (lines 333-449) duplicates ~90% of the PVC deletion logic from getNIMResourcesToDelete (lines 203-323): the inference count fetch, two-tier PVC ownership check, Dashboard-managed pattern matching, and secret deletion guard. Only the source of pvcName and secret names differs.

Consider extracting the shared logic into a helper, e.g.:

Sketch of a shared helper
+async function deleteNIMPVCIfUnused(
+  pvcName: string,
+  projectName: string,
+  currentDeploymentName: string,
+): Promise<Promise<void>[]> {
+  // shared two-tier PVC check + deletion logic
+}
+
+async function deleteNIMSecretsIfLast(
+  projectName: string,
+  nimSecretName: string | undefined,
+  imagePullSecretName: string | undefined,
+  inferenceCount: number,
+): Promise<Promise<void>[]> {
+  // shared secret deletion guard
+}

This would reduce maintenance burden and ensure both paths stay in sync as the logic evolves.


170-186: NIM Operator not-installed error handling looks reasonable but consider tightening the check.

The catch block on line 182 checks nimError.message.includes('not found') to suppress expected errors when the NIM Operator CRD isn't installed. However, a 404 for a specific NIMService resource would also contain "not found" and be silently swallowed. Since this is a list operation that would fail with a CRD-not-found error, this is likely fine in practice, but a more specific check (e.g., matching on HTTP status code or CRD-related error patterns) would be more robust.

frontend/src/pages/modelServing/screens/global/__tests__/nimOperatorUtils.spec.ts (2)

319-367: filterNIMSystemEnvVars tests don't cover all 7 system env vars.

The "should filter out system-managed NIM environment variables" test only includes NIM_CACHE_PATH, NGC_API_KEY, and NIM_SERVER_PORT, but the implementation's NIM_SYSTEM_ENV_VARS list also includes OUTLINES_CACHE_DIR, NIM_HTTP_API_PORT, NIM_JSONL_LOGGING, and NIM_LOG_LEVEL. If any of those entries were accidentally removed from the constant, these tests would still pass.

Consider adding them to the test input to ensure full coverage of the filter list.


1-368: Good coverage for pure utilities; consider adding tests for getInferenceServiceDisplayName and useInferenceServiceDisplayName.

The async function getInferenceServiceDisplayName and the React hook useInferenceServiceDisplayName are not covered. These have non-trivial logic (NIMService fetch with fallback, stale-guard in the effect) that would benefit from unit tests, especially the error/fallback path.

frontend/src/pages/modelServing/screens/global/nimOperatorUtils.ts (3)

20-38: Minor: redundant v1alpha1 exact match.

Line 26 checks ref.apiVersion === 'apps.nvidia.com/v1alpha1' but line 27 already covers it with ref.apiVersion.startsWith('apps.nvidia.com/'). The explicit check is redundant but harmless — it may serve as documentation of the currently expected version.


62-70: Consider keeping NIM_SYSTEM_ENV_VARS as a Set for O(1) lookups.

Currently NIM_SYSTEM_ENV_VARS is an array, and filterNIMSystemEnvVars uses .includes() which is O(n) per check. With 7 entries this is negligible, but a Set would be more idiomatic for membership checks and scales better if the list grows.


145-163: The as any cast is necessary but fragile — document the NIM Operator predictor schema deviation.

The NIM Operator creates InferenceServices with spec.predictor.containers[] instead of the standard spec.predictor.model. This is a schema deviation that isn't captured by InferenceServiceKind. A type extension or a comment referencing the NIM Operator documentation would help future maintainers understand why this cast is needed.

Comment thread packages/kserve/extensions.ts Outdated
@mtalvi
Copy link
Copy Markdown
Contributor Author

mtalvi commented Feb 15, 2026

@coderabbitai resolve

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 15, 2026

✅ Actions performed

Comments resolved. Auto-approval is disabled; enable reviews.request_changes_workflow to approve automatically.

@mtalvi mtalvi requested a review from emilys314 February 15, 2026 14:17
@mtalvi
Copy link
Copy Markdown
Contributor Author

mtalvi commented Feb 15, 2026

The failing tests are not related to my changes.

Copy link
Copy Markdown
Member

@andrewballantyne andrewballantyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm partially way through this -- can you help me understand some of these things?

Comment thread backend/src/utils/constants.ts Outdated
Comment thread frontend/src/api/k8s/nim/nimServices.ts
Comment thread frontend/src/api/models/odh.ts
Comment thread frontend/src/api/index.ts Outdated
Comment thread frontend/src/concepts/areas/const.ts Outdated
Comment thread frontend/src/concepts/hardwareProfiles/const.ts
Comment thread frontend/src/concepts/hardwareProfiles/useServingHardwareProfileConfig.ts Outdated
Comment thread frontend/src/concepts/modelServingKServe/ServingRuntimeTokensTable.tsx Outdated
@openshift-merge-robot openshift-merge-robot added the needs-rebase PR needs to be rebased label Mar 11, 2026
@openshift-merge-robot openshift-merge-robot removed the needs-rebase PR needs to be rebased label Mar 12, 2026
@openshift-ci openshift-ci Bot added the needs-rebase PR needs to be rebased label Mar 22, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 22, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@github-actions
Copy link
Copy Markdown
Contributor

This PR is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale Issue was created a long time ago and nothing has happened label Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants