A Kubernetes Helm chart for running the service mode of
nemo-retriever: a FastAPI document ingestion server that
streams uploads through a set of NVIDIA NIM microservices
(page-elements, table-structure, OCR, VLM embed by default) and exposes
result + status APIs over HTTP / SSE.
Use Helm (this chart and/or the additional Library charts documented in the NeMo Retriever Library) for supported NIM and service deployment.
The chart ships two deployable layers behind feature flags:
- the service — always on; one Deployment (standalone) or three
Deployments (split topology: gateway / realtime / batch), built from
Dockerfile --target service. - the NIMs — optional, GPU-backed
NIMCache+NIMServicecustom resources (apiVersion: apps.nvidia.com/v1alpha1) reconciled by the NVIDIA NIM Operator. The chart auto-wires the operator-managed Service URLs into the retriever-service config when the operator CRDs are present in the cluster.
NIM Operator prerequisite. The NIM templates are gated on the
apps.nvidia.com/v1alpha1API group. Install the NIM Operator before runninghelm install: https://docs.nvidia.com/nim-operator/Without the operator the chart still installs cleanly — every NIMCache / NIMService template short-circuits and the service falls back to external NIM URLs supplied via
serviceConfig.nimEndpoints.*.
Persistence today is SQLite on a single ReadWriteOnce PVC, which caps the service at one replica. The chart already exposes the HPA scaffolding so it's a one-line change once the planned PostgreSQL backend lands.
For behavioral consistency between local HuggingFace deployments and Helm service deployments: `results = ingestor.ingest(...return_results=True) return_results defaults to True. This incurs a significant performance and system memory usage cost. Unless you know explicitly you need to fetch extraction results to the client, you should use: return_results=False If you must return results, you may need to increase pod memory specs to support the increased pod memory usage.
nemo_retriever/helm/
├── Chart.yaml
├── values.yaml
├── README.md <-- this file
├── .helmignore
└── templates/
├── _helpers.tpl
├── NOTES.txt
├── configmap.yaml # renders retriever-service.yaml
├── deployment.yaml # the service Deployment(s)
├── service.yaml # ClusterIP/NodePort for the service
├── ingress.yaml # optional Ingress
├── hpa.yaml # optional HorizontalPodAutoscaler
├── servicemonitor.yaml # optional Prometheus ServiceMonitor
├── serviceaccount.yaml
├── pvc.yaml # SQLite database PVC
├── secrets.yaml # ngc-secret + ngc-api
└── nims/
├── nemotron-page-elements-v3.yaml # NIMCache + NIMService
├── nemotron-table-structure-v1.yaml # NIMCache + NIMService
├── nemotron-ocr-v1.yaml # NIMCache + NIMService (OCR)
├── llama-nemotron-embed-vl-1b-v2.yaml # NIMCache + NIMService (VLM embed)
├── llama-nemotron-rerank-vl-1b-v2.yaml # NIMCache + NIMService (optional; not auto-wired)
├── nemotron-parse.yaml # NIMCache + NIMService (optional; not auto-wired)
├── nemotron-3-nano-omni-30b-a3b-reasoning.yaml # NIMCache + NIMService (optional; not auto-wired)
└── audio.yaml # NIMCache + NIMService (optional; not auto-wired)
The chart defaults to the staging image published to NGC:
nvcr.io/nvstaging/nim/nemo-retriever-service:043020205-001
Pulling from nvcr.io/nvstaging requires an NGC pull secret — either set
ngcImagePullSecret.create=true (see below) or pre-create one in the
namespace named ngc-secret.
To run a locally built image instead, build and push it from the repo root,
then override service.image.repository / service.image.tag:
# from the repo root:
docker build \
--target service \
-t <YOUR_REGISTRY>/nemo-retriever-service:<TAG> .
docker push <YOUR_REGISTRY>/nemo-retriever-service:<TAG>Audio and video extraction require the ffmpeg and ffprobe system
binaries inside the service container. The bundled service image can install
them at container startup when you set service.installFfmpeg=true, which
sets INSTALL_FFMPEG=true for the image entrypoint:
helm upgrade --install retriever ./nemo_retriever/helm \
--set service.image.repository=<YOUR_REGISTRY>/nemo-retriever-service \
--set service.image.tag=<TAG> \
--set service.installFfmpeg=trueDo not also set INSTALL_FFMPEG in service.env; the chart fails rendering
when both are configured so the rendered Pod does not contain duplicate
environment variables.
When service.installFfmpeg=false (the default), the service still starts
normally and processes PDF, image, text and HTML uploads. Audio / video
uploads are rejected up-front with HTTP 501:
Audio and video ingestion require FFmpeg in the retriever service
container, but the following dependencies are missing: ffmpeg, ffprobe.
Re-deploy the Helm chart with `--set service.installFfmpeg=true` …
The retriever-service container also logs a WARNING at startup when
FFmpeg is missing so cluster operators can fix the deployment before
the first media upload arrives, instead of debugging a Ray worker
traceback (RuntimeError: MediaChunkActor requires media dependencies; missing: ffmpeg, ffprobe) after the fact. The same WARNING is emitted
on every pod (gateway, realtime, batch) because all roles classify
uploads — flipping service.installFfmpeg=true updates them all.
Runtime installation uses passwordless sudo scoped to installing the
ffmpeg package in the service image. The pod must have network egress to the
Ubuntu package repositories, a writable root filesystem, and a security policy
that allows sudo/setuid behavior. Do not set
service.securityContext.allowPrivilegeEscalation: false or
service.securityContext.readOnlyRootFilesystem: true for this path.
For air-gapped or locked-down clusters, see Deployment options — Air-gapped and disconnected deployment. On a connected staging host you can extend the service image, for example:
FROM <YOUR_REGISTRY>/nemo-retriever-service:<BASE_TAG>
USER root
RUN apt-get update && apt-get install -y --no-install-recommends ffmpeg \
&& rm -rf /var/lib/apt/lists/*
USER nemoIf you already have NIM endpoints reachable from the cluster (e.g. another namespace, or NVIDIA Build), turn the master switch off and supply the URLs directly:
helm install retriever ./nemo_retriever/helm \
--set nims.enabled=false \
--set ngcImagePullSecret.create=true \
--set ngcImagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.create=true \
--set ngcApiSecret.password=$NGC_API_KEY \
--set serviceConfig.nimEndpoints.pageElementsInvokeUrl=http://page-elements.svc:8000/v1/infer \
--set serviceConfig.nimEndpoints.tableStructureInvokeUrl=http://table-structure.svc:8000/v1/infer \
--set serviceConfig.nimEndpoints.ocrInvokeUrl=http://ocr.svc:8000/v1/infer \
--set serviceConfig.nimEndpoints.embedInvokeUrl=http://embed.svc:8000/v1/embeddingsngcApiSecret materialises an ngc-api Secret containing both
NGC_API_KEY and NGC_CLI_API_KEY keys; the service container reads it
via optional: true secretKeyRef, so the install still succeeds when
the secret is absent (useful for fully local NIM endpoints).
Install the NIM Operator first so
the NIMCache / NIMService CRDs (apps.nvidia.com/v1alpha1) are
registered. A plain helm install reconciles the four core NIMs
(page_elements, table_structure, ocr, vlm_embed) — every other
NIM (the VL reranker rerankqa, Nemotron Parse, Omni 30B, and the
Parakeet audio ASR NIM) is disabled by default to honor the
"optional and disabled by default" contract in
deployment-options.md;
see Recommended minimal install (26.05)
for the opt-in --set flags that turn any of them on.
helm install retriever ./nemo_retriever/helm \
--set ngcImagePullSecret.create=true \
--set ngcImagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.create=true \
--set ngcApiSecret.password=$NGC_API_KEYDeploy only the four core NIMs that the retriever service auto-wires (page_elements, table_structure, ocr, vlm_embed):
helm install retriever ./nemo_retriever/helm \
--set ngcImagePullSecret.create=true \
--set ngcImagePullSecret.password=$NGC_API_KEY \
--set ngcApiSecret.create=true \
--set ngcApiSecret.password=$NGC_API_KEYThe VL reranker (
rerankqa), Nemotron Parse, the Nemotron 3 Nano Omni 30B caption NIM, and the ParakeetaudioASR NIM are all off by default in 26.05 — they only reconcile when you explicitly opt in. Opt-in flags:
- VL reranker —
--set nimOperator.rerankqa.enabled=true- Nemotron Parse —
--set nimOperator.nemotron_parse.enabled=true- Omni 30B captioner —
--set nimOperator.nemotron_3_nano_omni_30b_a3b_reasoning.enabled=true- Parakeet ASR —
--set nimOperator.audio.enabled=true(also setserviceConfig.nimEndpoints.audioGrpcEndpoint=audio:50051to wire ASR into the service, plusservice.installFfmpeg=trueif your image does not bundle ffmpeg)This matches the "optional and disabled by default" contract in deployment-options.md and avoids silently pulling ≈ 62 GiB of Omni weights or claiming a second dedicated GPU on a "default" install. See the model hardware requirements table for per-NIM GPU and disk costs.
The chart auto-wires the operator-managed in-cluster URLs of the four
"core" NIMs into the service's nim_endpoints block:
| key | operator-managed Service | invoke path |
|---|---|---|
nimOperator.page_elements |
nemotron-page-elements-v3 |
/v1/infer |
nimOperator.table_structure |
nemotron-table-structure-v1 |
/v1/infer |
nimOperator.ocr |
nemotron-ocr-v1 |
/v1/infer |
nimOperator.vlm_embed |
llama-nemotron-embed-vl-1b-v2 |
/v1/embeddings |
Track operator reconciliation with:
kubectl get nimcache,nimservice -n <namespace>
kubectl describe nimservice nemotron-page-elements-v3 -n <namespace>First-time NIMCache reconciliation downloads model weights to a PVC. By
default (nimOperator.nimCache.keepOnUninstall: true) every NIMCache
carries helm.sh/resource-policy: keep so those downloads survive
helm uninstall. NIMService CRs do not use keep and are removed by
Helm on uninstall.
| What you see | Typical cause |
|---|---|
NIMCache + PVC remain |
Expected when keepOnUninstall is true (default). Helm intentionally skips deleting caches so you do not re-pull multi‑GiB weights. |
NIMService CR remains |
Not expected on a normal uninstall. Usually an orphan from a failed install/upgrade (release never recorded the resource, or the chart renamed a NIM). |
| Deployments / GPU pods still running | Often the operator workload for a kept NIMCache, or a stale NIMService that Helm did not own. Check kubectl get nimservice,nimcache -n <ns>. |
nemotron-*-job-* pods in Error |
The NIM Operator's model-download Job for a NIMCache (not the retriever service). Failed cache pulls retry and leave Error pods until the Job or NIMCache is deleted. Common after a failed helm install when the release is rolled back but keep retains the cache CR. |
helm uninstall appears to do nothing |
Release may be missing or failed (helm list -n <ns> -a). CRs created before a failed install can be left without a release to clean them up. |
Full teardown (dev cluster — deletes caches and PVCs Helm kept):
NS=retriever
REL=nemo-retriever
helm uninstall "${REL}" -n "${NS}" 2>/dev/null || true
# Orphans and kept NIMCaches (Helm keep does not block kubectl delete):
kubectl delete nimservice,nimcache -n "${NS}" --all
# Optional: drop model PVCs if you will re-pull from NGC
kubectl delete pvc -n "${NS}" -l 'app.kubernetes.io/managed-by=nvidia-nim-operator' 2>/dev/null || trueDev installs that should not retain caches on uninstall:
helm upgrade --install "${REL}" ./nemo_retriever/helm -n "${NS}" \
--set nimOperator.nimCache.keepOnUninstall=false \
...The full schema lives in values.yaml. Below is the
short list of knobs you'll touch first.
| Path | Default | Notes |
|---|---|---|
service.image.repository |
localhost:32000/nemo-retriever-service |
Override to a published image. |
service.image.tag |
latest |
|
service.replicas |
1 |
Hard cap = 1 while SQLite is the backend. |
service.installFfmpeg |
false |
Install ffmpeg/ffprobe at container startup by setting INSTALL_FFMPEG=true. Requires network egress, writable root filesystem, and sudo/setuid allowed. Not for air-gapped clusters — use a custom image instead. |
service.resources.requests |
16 / 16Gi |
Tune in tandem with serviceConfig.pipeline.*Workers. |
service.resources.limits |
96 / 96Gi |
|
service.gpu.enabled |
false |
The service does not need a GPU. |
For audio and video extraction, set service.installFfmpeg=true when your
cluster allows runtime package installation. For air-gapped clusters, see
Deployment options — Air-gapped and disconnected deployment.
To run self-hosted Parakeet for audio and video extraction:
- Set
nimOperator.audio.enabled=true(it is on by default; disable other optional NIMs you do not need per Recommended minimal install (26.05)). - Pin the ASR
NIMServiceto a dedicated GPU withnimOperator.audio.resources,nodeSelector, ortolerations(see NIM Operator). - Confirm the GPU SKU in Model hardware requirements (footnote ⁴ lists Blackwell limitations).
- Set
service.installFfmpeg=truewhen the retriever service will process audio or video (seeservice.installFfmpegabove).
The retriever service picks up the in-cluster ASR endpoint when nimOperator.audio is enabled; see NIM Operator sub-stack.
| Path | Default | Notes |
|---|---|---|
serviceConfig.server.port |
7670 |
Container + Service port. |
serviceConfig.pipeline.realtimeWorkers |
24 |
Per-pod realtime worker count. |
serviceConfig.pipeline.batchWorkers |
48 |
Per-pod batch worker count. See Timeouts and alleviating ingest failures if embed or pool errors appear under load. |
serviceConfig.nimEndpoints.*InvokeUrl |
"" |
Override the auto-resolved NIM Operator URL. Available knobs: pageElementsInvokeUrl, tableStructureInvokeUrl, ocrInvokeUrl, embedInvokeUrl, and captionInvokeUrl (see Image captioning (Omni 30B)). |
serviceConfig.nimEndpoints.captionModelName |
"" |
Model id sent to the remote VLM. Auto-set to nvidia/nemotron-3-nano-omni-30b-a3b-reasoning whenever a caption URL is resolved. |
serviceConfig.vectordb.enabled |
true |
Deploy the LanceDB vectordb Pod. When true the chart requires a resolvable embed endpoint (see VectorDB and the embed endpoint); helm install / helm upgrade fails fast otherwise. |
serviceConfig.vectordb.lancedbUri |
/data/vectordb |
LanceDB on the vectordb Pod's PVC. |
serviceConfig.vectordb.embedModel |
nvidia/llama-nemotron-embed-vl-1b-v2 |
Passed to vectordb + worker embed_model_name. |
The vectordb Pod's /v1/query handler embeds the incoming query text
before searching LanceDB. It needs a NIM embedding endpoint to do that,
and rendering the Deployment with an empty --embed-endpoint produces a
Pod that passes its /v1/health probe but answers every /v1/query
request with HTTP 501 No embedding endpoint configured. — a healthy
deployment that silently breaks retrieval.
To prevent this, the chart now refuses to render
deployment-vectordb.yaml when no embed endpoint can be resolved.
helm install / helm upgrade --install fails with a message listing
the three supported escape valves:
serviceConfig.vectordb.enabled=true but the embed endpoint could not be
resolved. Pick one of:
1. --set serviceConfig.nimEndpoints.embedInvokeUrl=http://<host>:<port>/v1/embeddings
2. --set nimOperator.vlm_embed.enabled=true # requires apps.nvidia.com/v1alpha1 CRDs
3. --set serviceConfig.vectordb.enabled=false
Resolution order matches the rest of the chart (see Mix and match NIM sources):
- Explicit
serviceConfig.nimEndpoints.embedInvokeUrlalways wins. - Otherwise the operator-managed URL of
nimOperator.vlm_embed.nimServiceNameis used, providednimOperator.vlm_embed.enabled=trueand theapps.nvidia.com/v1alpha1CRDs are installed in the cluster. - Otherwise the chart fails the install.
Each NIM block under nimOperator.<key> renders a NIMCache + NIMService
pair gated on three conditions ALL holding:
- The
apps.nvidia.com/v1alpha1CRDs are installed in the cluster. - The master switch
nims.enabledistrue. - The per-NIM
nimOperator.<key>.enabledistrue.
| Path | Default | Notes |
|---|---|---|
nims.enabled |
true |
Master switch. Set false to render no NIM resources. |
nimOperator.page_elements.enabled |
true |
Page-elements detector NIM. |
nimOperator.table_structure.enabled |
true |
Table-structure detector NIM. |
nimOperator.ocr.enabled |
true |
OCR NIM. |
nimOperator.ocr.image |
nvcr.io/nim/nvidia/nemotron-ocr-v1:1.3.0 |
Default OCR NIM image. |
nimOperator.vlm_embed.enabled |
true |
Multimodal embedding NIM (also used by the vectordb Pod). |
nimOperator.vlm_embed.nimServiceName |
llama-nemotron-embed-vl-1b-v2 |
NIMService / in-cluster DNS name. |
nimOperator.vlm_embed.image |
nvcr.io/nim/nvidia/llama-nemotron-embed-vl-1b-v2:1.12.0 |
Default VLM embed NIM image. |
nimOperator.rerankqa.enabled |
false |
VL reranker NIM (optional; not auto-wired). Set true to opt in. Default false so 26.05 installs honor the "optional and disabled by default" contract in deployment-options.md and do not silently provision an extra ≈ 3.1 GiB GPU NIM. The image points at the VL SKU (llama-nemotron-rerank-vl-1b-v2) per prerequisites-support-matrix.md — the text-only llama-nemotron-rerank-1b-v2 silently degrades multimodal reranking and is not the documented POR. |
nimOperator.nemotron_parse.enabled |
false |
Structured-parse NIM (optional). Set true when using extract_method="nemotron_parse". Default false so 26.05 installs honor the "optional and disabled by default" contract in deployment-options.md. Image tag follows the image tag conventions. |
nimOperator.nemotron_3_nano_omni_30b_a3b_reasoning.enabled |
false |
Omni 30B caption NIM (optional). Set true to enable image captioning — see Image captioning (Omni 30B). Default false so 26.05 installs do not silently pull ≈ 62 GiB of BF16 weights or claim a second dedicated GPU. Image tag follows the image tag conventions. |
nimOperator.audio.enabled |
false |
Parakeet ASR NIM (optional). Set true for audio/video transcription; pair with serviceConfig.nimEndpoints.audioGrpcEndpoint=audio:50051 so the retriever-service can reach it. |
nimOperator.<key>.image.repository |
nvcr.io/nim/nvidia/... |
Per-NIM image. |
nimOperator.<key>.image.pullSecrets |
[ngc-secret] |
Referenced by the NIMService CR. |
nimOperator.<key>.authSecret |
ngc-api |
NIM auth Secret name. |
nimOperator.<key>.storage.pvc.size |
25Gi (50Gi for vlm_embed/rerankqa, 100Gi parse, 300Gi VL) |
NIMCache PVC size. |
nimOperator.<key>.replicas |
1 |
Per-NIMService replica count. |
nimOperator.nimServiceGpuLimit |
1 |
Default nvidia.com/gpu limit on every NIMService when per-NIM resources is {}. Set to null for operator-only reconciliation (not reliable on all NIM Operator versions — see GPU limits and helm upgrade). |
nimOperator.<key>.resources |
{} |
Per-NIM override of the whole resources block. Empty uses nimServiceGpuLimit; non-empty replaces the chart default (may require --force-conflicts on later helm upgrade). |
nimOperator.modelProfile |
{} |
Chart-wide NIMCache GPU/profile filter. Applied to every NIMCache that does not have its own override. See Filtering cached GPU profiles. |
nimOperator.<key>.modelProfile |
{} |
Per-NIM NIMCache GPU/profile filter. Non-empty values REPLACE the chart-wide default (no merge). See Filtering cached GPU profiles. |
nimOperator.<key>.expose.service.port |
8000 (9000 for audio) |
HTTP port. |
nimOperator.<key>.expose.service.grpcPort |
8001 (50051 for audio) |
gRPC port. |
Only the four "core" NIMs (page_elements, table_structure, ocr, vlm_embed) are auto-wired into the retriever-service config. Optional NIMs may reconcile when
nimOperator.<key>.enabledistrueinvalues.yaml, but the retriever-service won't call them unless you wire your pipeline to use them. For 26.05, prefer the minimal install overrides.
Every NIMCache the chart renders supports the NIM Operator's
spec.source.ngc.model block, which restricts which model profiles the
cache job downloads. The chart exposes this through two values:
| Path | Scope | Behaviour |
|---|---|---|
nimOperator.modelProfile |
Chart-wide | Applied to every NIMCache that doesn't carry its own override. |
nimOperator.<key>.modelProfile |
Per-NIM | When non-empty, REPLACES the chart-wide default (no merge). |
Both default to {}. With both empty the chart emits no model:
block and the NIM Operator falls back to its "cache every profile
applicable to the detected GPUs" default — fine on a single-GPU
laptop, but on heterogeneous clusters (or any cluster with ≥ 3 NIMs)
this wastes tens of GiB of PVC storage, NGC bandwidth, and cache-job
runtime.
The mapping is rendered verbatim under spec.source.ngc.model, so the
shape lines up 1:1 with the NIMCache CRD.
Two filter dimensions are supported (use whichever fits your cluster;
gpus is the common case):
nimOperator:
modelProfile:
gpus:
# NIMCache only downloads profiles compatible with at least one
# of these GPU selectors. Each selector is {ids: [...], product: ...}.
- ids: ["26B5"] # PCI device ID(s)
product: "NVIDIA-H100-80GB-HBM3" # NVIDIA marketing name
# profiles:
# # Alternative: list of exact profile UUIDs from `ngc registry
# # model list-profiles <repo>/<image>:<tag>`.
# - "11111111-2222-3333-4444-555555555555"Equivalent overrides via --set:
# Homogeneous H100 80 GB cluster — every NIMCache only pulls the H100 profile:
helm upgrade --install retriever ./nemo_retriever/helm \
--set 'nimOperator.modelProfile.gpus[0].ids[0]=26B5' \
--set 'nimOperator.modelProfile.gpus[0].product=NVIDIA-H100-80GB-HBM3'
# Restrict only the page_elements NIMCache to a specific profile UUID, leave the rest alone:
helm upgrade --install retriever ./nemo_retriever/helm \
--set 'nimOperator.page_elements.modelProfile.profiles[0]=11111111-2222-3333-4444-555555555555'
# Chart-wide H100 default plus a per-NIM override (the override REPLACES the global; it does NOT merge):
helm upgrade --install retriever ./nemo_retriever/helm \
--set 'nimOperator.modelProfile.gpus[0].product=NVIDIA-H100-80GB-HBM3' \
--set 'nimOperator.vlm_embed.modelProfile.profiles[0]=22222222-3333-4444-5555-666666666666'Tips:
- Run
ngc registry model list-profiles nvcr.io/nim/nvidia/<image>:<tag>to enumerate the available profiles for any chart-pinned NIM image and pick the smallest profile that matches your GPU. - Filter mismatches surface as
NIMCacheevents such asNoCompatibleProfile; check withkubectl describe nimcache <name>. - The chart's defaults (
{}) preserve operator behaviour, so addingmodelProfileis a strict opt-in — existing releases keep working unchanged.
Every NIM in this chart pins an exact NGC image tag in values.yaml
— there is no :latest floating reference. Two tag families show up:
| Family | Example | Meaning |
|---|---|---|
| Plain semver | nemotron-page-elements-v3:1.8.0 |
A standard NIM release, identical bytes on every pull. Used by the four core NIMs and the reranker / ASR NIMs. |
<semver>-variant |
nemotron-parse-v1.2:1.7.0-variant, nemotron-3-nano-omni-30b-a3b-reasoning:1.7.0-variant |
The Nemotron Parse and Nemotron 3 Nano Omni 30B builds that ship per-GPU TensorRT engine variants the NIM Operator selects from at reconciliation time (see the Omni and Parse rows in the model hardware requirements table). The -variant suffix is the NGC tag that ships alongside the 26.05 chart and matches footnote ³ of the support matrix. |
For air-gapped mirror pipelines: mirror the exact tag — both the
plain semver and the -variant form — and do not substitute :latest.
Substituting :latest would pin to a moving target that may not match
the engine plans the NIM Operator profile expects for a given GPU.
If you want a different NIM build, override the tag explicitly:
helm upgrade --install retriever ./nemo_retriever/helm \
--set nimOperator.nemotron_3_nano_omni_30b_a3b_reasoning.enabled=true \
--set nimOperator.nemotron_3_nano_omni_30b_a3b_reasoning.image.tag=<your-tag>and validate against the same release of the retriever service before production rollout.
Charts and captioning (26.05). Charts and infographics use page_elements
and ocr (no graphic_elements operator NIM in this chart). For image
captioning, set nimOperator.nemotron_3_nano_omni_30b_a3b_reasoning.enabled=true — see
Image captioning (Omni 30B) for the
chart-side wiring and
Image captioning (26.05)
for the product matrix.
The Nemotron 3 Nano Omni VLM is the canonical image-caption NIM for 26.05. When you enable it,
helm upgrade --install retriever ./nemo_retriever/helm \
--set nimOperator.nemotron_3_nano_omni_30b_a3b_reasoning.enabled=true \
...the chart now auto-wires two fields into the rendered
retriever-service.yaml ConfigMap:
nim_endpoints:
caption_invoke_url: "http://nemotron-3-nano-omni-30b-a3b-reasoning:8000/v1/chat/completions"
caption_model_name: "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"The service derives caption_enabled=true from a non-null
caption_invoke_url, so the ingestion pipeline routes caption work to
the in-cluster Omni Pod with no manual ConfigMap edits.
Resolution order mirrors every other NIM endpoint (see the NIM Operator sub-stack section):
- Explicit
serviceConfig.nimEndpoints.captionInvokeUrlalways wins (use this to point at a hosted endpoint, e.g.https://integrate.api.nvidia.com/v1/chat/completions). - Otherwise the operator-managed URL of
nemotron-3-nano-omni-30b-a3b-reasoningis used, providednimOperator.nemotron_3_nano_omni_30b_a3b_reasoning.enabled=trueand theapps.nvidia.com/v1alpha1CRDs are installed. - Otherwise
caption_invoke_urlstaysnulland the caption stage is disabled.
serviceConfig.nimEndpoints.captionModelName follows the same order —
it defaults to the canonical Omni remote model id
(nvidia/nemotron-3-nano-omni-30b-a3b-reasoning, matching
nemo_retriever.common.modality.caption.model_profiles.OMNI_REMOTE_MODEL_ID) whenever
the chart resolves any caption URL. Override only when pointing at a
different VLM SKU.
The chart defaults to nimOperator.nimServiceGpuLimit: 1, which
renders spec.resources.limits.nvidia.com/gpu: 1 on every NIMService
unless a per-NIM resources map overrides it. This is required on
NIM Operator v3.1.1 (and other versions tested on A100/H100): when
the chart omits the resources block entirely, the operator often
does not populate GPU limits from the model profile, and NIM pods
start without GPU access (The NVIDIA Driver was not detected).
Trade-off: Helm and the NIM Operator may both server-side-apply
spec.resources.limits.nvidia.com/gpu. A later helm upgrade --install
can then fail with:
Error: UPGRADE FAILED: conflict occurred while applying object
<ns>/<nim> apps.nvidia.com/v1alpha1, Kind=NIMService:
Apply failed with 1 conflict:
conflict with "manager" using apps.nvidia.com/v1alpha1:
.spec.resources.limits.nvidia.com/gpu
Operator-only mode (omit GPU limits from Helm — only if your NIM Operator version reliably reconciles them):
nimOperator:
nimServiceGpuLimit: nullIf upgrades hit SSA conflicts after the operator has reconciled GPU limits, use one of:
helm upgrade --install … --force-conflicts --server-sidekubectl -n <ns> edit nimservice <name>to set GPU limits outside Helm
To pin a non-default GPU count chart-wide, set nimServiceGpuLimit: 2
(or set per-NIM resources.limits.nvidia.com/gpu).
The core OCR NIM is configured under nimOperator.ocr (the ocr:
block). Confirm image.repository and image.tag before you upgrade.
| Path | Role |
|---|---|
nimOperator.nimCache.keepOnUninstall |
When true, NIMCache CRs survive helm uninstall (helm.sh/resource-policy: keep). NIMService CRs are always removed. Set false for dev clusters that should fully tear down on uninstall. |
nimOperator.ocr.enabled |
Reconcile the OCR NIMService |
nimOperator.ocr.image.repository |
NIM image (default nvcr.io/nim/nvidia/nemotron-ocr-v1) |
nimOperator.ocr.image.tag |
Pin the image tag for reproducible upgrades |
Override the auto-wired in-cluster URL with serviceConfig.nimEndpoints.ocrInvokeUrl
when the OCR service runs outside the operator sub-stack.
| Path | Default | Notes |
|---|---|---|
persistence.enabled |
true |
|
persistence.size |
50Gi |
|
persistence.accessModes |
[ReadWriteOnce] |
Required by SQLite. |
persistence.storageClass |
"" |
Use cluster default unless set. Use "-" to disable a storageClassName. |
persistence.mountPath |
/var/lib/nemo-retriever |
Both DB and log file are written here. |
| Path | Default | Notes |
|---|---|---|
ngcImagePullSecret.create |
false |
Chart-managed dockerconfigjson Secret. |
ngcImagePullSecret.name |
ngc-secret |
Name referenced by every Pod and every NIMService. |
ngcImagePullSecret.password |
"" |
NGC API key. |
ngcApiSecret.create |
false |
Chart-managed Opaque Secret. |
ngcApiSecret.name |
ngc-api |
Name referenced by NIMCache/NIMService authSecret. |
ngcApiSecret.password |
"" |
NGC API key (populates NGC_API_KEY + NGC_CLI_API_KEY). |
imagePullSecrets |
[] |
Extra pre-existing pull secrets appended to every Pod. |
| Feature | Toggle | Default |
|---|---|---|
| Ingress | ingress.enabled |
true |
| Autoscaling (HPA) | autoscaling.enabled |
false (max=1 anyway) |
| ServiceMonitor | serviceMonitor.enabled |
false (auto-enabled in split mode) |
The chart renders retriever-service.yaml from structured values so you
shouldn't normally need to ship a verbatim file. If you really want to,
mount one via service.extraVolumes + service.extraVolumeMounts at
/etc/nemo-retriever/retriever-service.yaml (which silently overrides the
chart-managed ConfigMap because subPath mounts win).
ngcImagePullSecret:
create: false # don't render; reference an existing Secret
name: my-org-ngc-pull
ngcApiSecret:
create: false
name: my-org-ngc-apiThe chart will skip Secret creation. Make sure my-org-ngc-pull exists
as kubernetes.io/dockerconfigjson and my-org-ngc-api as Opaque with
an NGC_API_KEY key, in the release namespace.
nimOperator:
vlm_embed:
enabled: false # don't deploy the embed NIM in-cluster
serviceConfig:
nimEndpoints:
embedInvokeUrl: https://integrate.api.nvidia.com/v1/embeddingsThe chart's resolution order is explicit URL → operator-managed URL → empty, so per-endpoint overrides Just Work.
The Deployment carries a checksum/config annotation derived from the
ConfigMap, so helm upgrade automatically rolls the pod when any
serviceConfig.* value changes.
Batch ingest fans out extract and embed work to remote NIM HTTP endpoints.
Under heavy parallelism a single slow or overloaded NIM can cause timeouts,
and a worker process crash can surface as many simultaneous failed
document callbacks even though only one root cause occurred.
| Layer | Default | Where it is set |
|---|---|---|
| Remote embed HTTP calls | 600 s (10 min) | Service image (EmbedParams.request_timeout_s); not a Helm value today. |
| Gateway → realtime/batch proxy | 300 s | Rendered gateway.timeout_s in retriever-service.yaml (split topology). |
| VLM embed model name | serviceConfig.vectordb.embedModel |
Also copied into worker nim_endpoints.embed_model_name in the ConfigMap. |
Symptoms to look for in pod logs:
Embedding error occurred: timed outorhttpx.ReadTimeouton the batch pod.Batch process pool broken (worker crash)followed by manyBrokenProcessPoolfailures on other in-flight documents.- Embed NIM pod messages such as
failed to allocate pinned system memory(GPU pressure from too many concurrent/v1/embeddingsrequests).
The gateway pod usually only logs status=failed callbacks; diagnose on
batch (and realtime for page-sized uploads), plus the embed NIM pod.
1. Lower batch worker concurrency (first step).
The default serviceConfig.pipeline.batchWorkers is 48, which can saturate
a single in-cluster VLM embed NIM. If you see embed timeouts or pool crashes,
reduce batch parallelism to 16 and redeploy:
helm upgrade retriever ./nemo_retriever/helm \
--reuse-values \
--set serviceConfig.pipeline.batchWorkers=16You can tune further (for example 8 on small GPU nodes), but 16 is a
reasonable starting point when moving off the default. Realtime workers
(realtimeWorkers, default 24) are less likely to overload embed NIMs
because they handle smaller units of work; adjust them only if realtime
ingest shows the same timeout pattern.
2. Confirm embed wiring.
Ensure nim_endpoints.embed_model_name in the mounted config matches the
VLM embed NIM SKU (serviceConfig.vectordb.embedModel, default
nvidia/llama-nemotron-embed-vl-1b-v2). A model mismatch produces
HTTP 404 on /v1/embeddings, not a timeout, but is worth ruling out when
debugging failed ingests.
3. Retry failed documents.
Failures caused by a one-time pool restart are often transient. After lowering
batchWorkers and rolling the batch Deployment, resubmit documents that
failed with rows=0.
4. Scale or isolate the embed NIM.
If timeouts persist at batchWorkers: 16, add embed NIM replicas (when your
cluster has GPU capacity), point serviceConfig.nimEndpoints.embedInvokeUrl
at an external embed endpoint, or temporarily disable optional NIMs on
dev clusters to free GPU memory for vlm_embed.
5. Client and ingress timeouts.
Long batch jobs may exceed the gateway proxy timeout (300 s) or an Ingress
proxy-read-timeout. Increase ingress annotations if clients disconnect
while workers are still processing; see the commented example on
ingress.annotations in values.yaml.
In topology.mode: split deployments the realtime and batch worker
pods scale horizontally based on queue fill ratio and
95th-percentile processing latency. Both signals come straight out
of the pods' /metrics endpoint — the publisher is always on (see
nemo_retriever_pool_queue_depth_ratio in
prometheus.py).
The only choice you have to make is how the metrics get from
Prometheus into the Kubernetes HPA.
CPU-based HPA reacts to the pod that has already saturated its work. For an ingest pipeline that fans out to remote NIM endpoints, the work spends most of its time blocked on HTTP — CPU stays low even when the queue is full. Queue depth measures demand to be served, which is what we actually want to scale on. A 95th-percentile-latency signal rides alongside to catch the inverse case (a single hot pod whose queue is shallow but whose per-item processing has stalled).
The chart's autoscaling.queueDepth.backend controls which path is
wired up. All three options leave the metrics publisher untouched:
| backend | When to pick it | Cluster prerequisite |
|---|---|---|
prometheus-adapter (default) |
Production. One adapter feeds HPA + Grafana + future autoscalers. | Prometheus Operator + prometheus-community/prometheus-adapter. |
cpu |
Bootstrap / dev cluster without Prometheus. | None — built-in. |
keda |
Already standardised on KEDA org-wide. | KEDA operator (you install + apply your own ScaledObject). |
The chart-recommended path is prometheus-adapter. The reasoning is
documented in values.yaml; in short, it keeps a single Prometheus as
the source of truth, supports HPA's multi-metric arithmetic-mean
evaluation out of the box, and doesn't force the chart to bundle new
CRDs.
The chart renders a ConfigMap named
<release>-nemo-retriever-prom-adapter-rules containing PromQL rules
for the External Metrics API. You point your existing
prometheus-adapter at it:
helm upgrade prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--reuse-values \
--set rules.existing=<release>-nemo-retriever-prom-adapter-rulesThen verify both metrics show up in the External Metrics API:
kubectl get --raw \
"/apis/external.metrics.k8s.io/v1beta1/namespaces/$NS/nemo_retriever_pool_queue_depth_ratio_avg?labelSelector=pool%3Drealtime" \
| jq .Once that returns a non-empty items array, the HPAs rendered by this
chart will start consuming them. The HPA annotation
nemo-retriever.nvidia.com/hpa-signals documents the active set per
HPA, e.g. queueRatio=true latencyP95=true cpu=false.
Set autoscaling.queueDepth.backend: cpu and enable the CPU metric
under each role:
autoscaling:
queueDepth:
backend: cpu
topology:
realtime:
hpa:
metrics:
queueDepthRatio: { enabled: false }
processingLatencyP95: { enabled: false }
cpu: { enabled: true, targetUtilizationPercentage: 60 }
batch:
hpa:
metrics:
queueDepthRatio: { enabled: false }
processingLatencyP95: { enabled: false }
cpu: { enabled: true, targetUtilizationPercentage: 80 }The legacy topology.<role>.hpa.targetCPUUtilizationPercentage field
still works and behaves as an alias for the metrics.cpu block.
Set autoscaling.queueDepth.backend: keda and disable the chart-managed
HPAs:
autoscaling:
queueDepth: { backend: keda }
topology:
realtime: { hpa: { enabled: false } }
batch: { hpa: { enabled: false } }Then apply your own ScaledObject — example for the realtime pool:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: nemo-retriever-realtime
spec:
scaleTargetRef:
name: nemo-retriever-realtime
minReplicaCount: 2
maxReplicaCount: 8
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: nemo_retriever_pool_queue_depth_ratio
threshold: "0.5"
query: |
avg by (pool) (
nemo_retriever_pool_queue_depth{pool="realtime"}
/
on(pool, instance) group_left()
nemo_retriever_pool_max_queue_size{pool="realtime"}
)
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: nemo_retriever_pool_processing_duration_p95
threshold: "30"
query: |
histogram_quantile(
0.95,
sum by (le, pool) (
rate(nemo_retriever_pool_processing_duration_seconds_bucket{pool="realtime"}[2m])
)
)KEDA's biggest win is scale-from-zero, which we don't use today —
both minReplicas defaults are ≥ 1 because the realtime pod is on the
hot path for SSE consumers. If you do want scale-from-zero (e.g. a
nightly batch-only job tenant), KEDA is the right tool and this is the
escape hatch.
Per-role tuning lives under topology.<role>.hpa.metrics:
topology:
realtime:
hpa:
metrics:
queueDepthRatio: { enabled: true, target: "500m" } # 0.5
processingLatencyP95: { enabled: true, targetSeconds: "30" }
batch:
hpa:
metrics:
queueDepthRatio: { enabled: true, target: "700m" } # 0.7 — batch can run hot
processingLatencyP95: { enabled: true, targetSeconds: "120" }Quantity-string conventions are k8s standard: 500m == 0.5, 2, 2k,
etc. The target is per-replica because the HPA template uses
type: AverageValue for both External metrics — that's what makes
"scale up when average queue fill across pods exceeds 0.5" work
without baking the pod count into the publisher.
# Cause realtime pressure (anything that submits to /v1/ingest/job/.../page).
# Then watch the HPA decide:
kubectl get hpa -w
# And watch the active signals on each HPA:
kubectl get hpa <release>-realtime -o jsonpath='{.metadata.annotations.nemo-retriever\.nvidia\.com/hpa-signals}'The dashboard's Worker Pool Capacity card on the Overview page mirrors the same signal Prometheus is seeing, so it's a quick eyeball sanity check before opening Grafana.
The chart defaults target generic Kubernetes clusters that allow fixed numeric
UIDs (runAsUser / runAsGroup / fsGroup 1000). OpenShift 4.x
namespaces under the default restricted-v2 Security Context Constraint (SCC)
and Pod Security Admission (PSA) restricted profile assign a per-namespace
UID/GID range instead. A stock helm install without overrides therefore fails
SCC validation, emits PSA warnings, or crashes on log paths the random UID cannot
write.
We do not change chart defaults for OpenShift-only behavior (that would affect
other platforms). Use the overrides below on OpenShift, or save the YAML block
into a local values file and pass -f <file>.
| Control | Typical default on a new OpenShift project |
|---|---|
| SCC | restricted-v2 (first match in priority order) |
| PSA | pod-security.kubernetes.io/warn=restricted (and often audit=restricted; enforce may be unset on dev clusters) |
| UID assignment | SCC injects runAsUser / fsGroup from the namespace range (for example 1000750000–1000759999) |
On clusters with PSA enforce=restricted, missing container securityContext
fields become hard rejections, not warnings.
| Symptom on stock install | Cause | Helm override |
|---|---|---|
FailedCreate: UID/GID 1000 not in namespace range |
Hardcoded service.podSecurityContext UID/GID/fsGroup |
Omit runAsUser, runAsGroup, and fsGroup; keep only runAsNonRoot: true |
PSA warning: allowPrivilegeEscalation, capabilities, seccompProfile |
Empty service.securityContext |
Set restricted baseline on service.securityContext (see sample below) |
PermissionError on /var/lib/nemo-retriever/retriever-service.log when persistence.enabled=false |
Default log path is image-owned; random UID cannot write without a PVC | Point serviceConfig.logging.file at /tmp/... (chart mounts emptyDir at /tmp) |
CreateContainerConfigError: non-numeric image USER nemo on vectordb |
Vectordb container has no securityContext block for SCC to annotate |
Disable vectordb for smoke tests, or patch the vectordb Deployment after install (below) |
| PSA warnings on otel-collector | Otel Deployment has no securityContext in the chart |
topology.otel.enabled=false unless you patch that Deployment |
# OpenShift overrides for nemo-retriever Helm chart (restricted-v2 / PSA restricted).
# Save locally, then: helm install retriever ./nemo_retriever/helm -f <your-file>.yaml ...
service:
podSecurityContext:
runAsNonRoot: true
# Do NOT set runAsUser, runAsGroup, or fsGroup — OpenShift SCC assigns them.
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefault
serviceConfig:
logging:
# Writable without persistence PVC (chart always mounts emptyDir at /tmp).
file: /tmp/retriever-service.log
vectordb:
# Set false for minimal service-only validation; see vectordb patch below if enabled.
enabled: false
topology:
otel:
enabled: falseWhen persistence.enabled=true, you can keep the default log path under
persistence.mountPath (/var/lib/nemo-retriever) because the PVC is mounted and
SCC-assigned fsGroup applies. When persistence is off, always relocate logs to
/tmp (or another path backed by service.extraVolumes).
Matches QA validation with external NIMs disabled, no persistence, and no results PVC:
oc new-project nemo-retriever
oc create secret docker-registry ngc-secret -n nemo-retriever \
--docker-server=nvcr.io --docker-username='$oauthtoken' \
--docker-password="$NGC_API_KEY"
oc create secret generic ngc-api -n nemo-retriever \
--from-literal=NGC_API_KEY="$NGC_API_KEY" \
--from-literal=NGC_CLI_API_KEY="$NGC_API_KEY"
helm install retriever ./nemo_retriever/helm -n nemo-retriever \
-f <your-openshift-overrides>.yaml \
--set ngcImagePullSecret.create=false \
--set ngcApiSecret.create=false \
--set nims.enabled=false \
--set persistence.enabled=false \
--set retrieverResults.enabled=falseVerify pods:
oc get pods -n nemo-retriever
oc describe pod -l app.kubernetes.io/name=nemo-retriever -n nemo-retrieverYou should see SCC-assigned numeric runAsUser on containers that declare a
securityContext block, and no PSA warnings once overrides are applied.
serviceConfig.vectordb.enabled=true renders a vectordb container from the
same image (USER nemo, non-numeric). The chart does not yet expose a
securityContext value for that container. After helm install, patch the
Deployment so OpenShift can inject a numeric UID into the container spec:
RELEASE=retriever
NS=nemo-retriever
VDB_DEPLOY="${RELEASE}-nemo-retriever-vectordb"
oc patch deployment "$VDB_DEPLOY" -n "$NS" --type=json -p='[
{"op": "add", "path": "/spec/template/spec/containers/0/securityContext", "value": {
"allowPrivilegeEscalation": false,
"capabilities": {"drop": ["ALL"]},
"runAsNonRoot": true,
"seccompProfile": {"type": "RuntimeDefault"}
}}
]'Re-apply the patch after helm upgrade if the Deployment is recreated. A future
chart release may add first-class topology.vectordb.securityContext values.
The chart’s otel-collector Deployment likewise lacks securityContext fields.
Prefer topology.otel.enabled=false (as in the sample values) unless you operate
your own collector or patch *-otel the same way as vectordb.
Do not bind the namespace to anyuid SCC or set PSA enforce=privileged
unless your security team explicitly approves it. The overrides above are intended
to keep restricted-v2 / PSA restricted posture.
See Deployment options — Air-gapped and disconnected deployment for overview and workflow. Chart-specific reference for mirroring:
Verify tags on the Git branch or tag you ship (for example 26.05 or
26.5.0). Defaults below match
values.yaml on the current chart.
| Role | nimOperator key |
Default image (repository:tag) |
|---|---|---|
| Retriever service | — | service.image.repository:service.image.tag (override for production) |
| Page elements | page_elements |
nvcr.io/nim/nvidia/nemotron-page-elements-v3:1.8.0 |
| Table structure | table_structure |
nvcr.io/nim/nvidia/nemotron-table-structure-v1:1.8.0 |
| OCR | ocr |
nvcr.io/nim/nvidia/nemotron-ocr-v1:1.3.0 |
| VL embed | vlm_embed |
nvcr.io/nim/nvidia/llama-nemotron-embed-vl-1b-v2:1.12.0 |
| VL reranker (optional) | rerankqa |
nvcr.io/nim/nvidia/llama-nemotron-rerank-vl-1b-v2:1.10.0 |
| Nemotron Parse (optional) | nemotron_parse |
nvcr.io/nim/nvidia/nemotron-parse-v1.2:1.7.0-variant |
| Omni caption (optional) | nemotron_3_nano_omni_30b_a3b_reasoning |
nvcr.io/nim/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:1.7.0-variant |
| Parakeet ASR (optional) | audio |
nvcr.io/nim/nvidia/parakeet-1-1b-ctc-en-us:1.5.0 |
GPU SKU support for audio is in Model hardware requirements.
Also mirror images for the vectordb sidecar, Redis, or other subcharts if your values enable them.
Example overrides (replace placeholders):
helm upgrade --install retriever ./nemo_retriever/helm \
-f my-airgap-values.yamlmy-airgap-values.yaml should include at least:
service:
image:
repository: <PRIVATE_REGISTRY>/nemo-retriever-service
tag: <PINNED_TAG>
pullPolicy: IfNotPresent
imagePullSecrets:
- name: my-private-registry
ngcImagePullSecret:
create: false # use secrets that authenticate to YOUR mirror
nimOperator:
page_elements:
image:
repository: <PRIVATE_REGISTRY>/nemotron-page-elements-v3
tag: "1.8.0"
pullPolicy: IfNotPresent
# Repeat for table_structure, ocr, vlm_embed, and any optional keys you enable.- Set
nimOperator.<key>.image.pullSecretsto the Secret name yourNIMServiceresources should use (defaults tongc-secret). - Leave
serviceConfig.nimEndpoints.*empty when operator-managed NIMs are in-cluster; set explicit URLs only for external or mirrored services outside the chart. - For offline captioning, enable
nimOperator.nemotron_3_nano_omni_30b_a3b_reasoningand point the pipeline caption endpoint at the in-cluster NIM URL (see Image captioning (26.05)).
docker login nvcr.io -u '$oauthtoken' -p "$NGC_API_KEY"
docker pull nvcr.io/nim/nvidia/nemotron-page-elements-v3:1.8.0
docker tag nvcr.io/nim/nvidia/nemotron-page-elements-v3:1.8.0 \
<PRIVATE_REGISTRY>/nemotron-page-elements-v3:1.8.0
docker push <PRIVATE_REGISTRY>/nemotron-page-elements-v3:1.8.0For bulk sync, prefer skopeo or
crane.
Record repository@sha256:... digests for regulated environments.
- PostgreSQL backend — replace
service.db.engine.DatabaseEnginewith a SQLAlchemy/asyncpg-based engine, then bump the chart to deploy a PostgreSQL StatefulSet (or take a sub-chart dependency on Bitnami's chart) and liftservice.replicasto N. - NetworkPolicies restricting the service Pod to the NIM Pods + DB only.
- Gateway autoscaling on inflight-uploads (currently fixed
topology.gateway.replicas) — sticky-routing story for SSE subscribers needs to land first.
The chart is exercised in CI with helm lint and helm template. Run
locally:
helm lint nemo_retriever/helm
# Operator CRDs present: vectordb resolves vlm_embed via the operator URL.
helm template r nemo_retriever/helm \
--api-versions apps.nvidia.com/v1alpha1 > /tmp/r-op.yaml
# Operator CRDs absent: vectordb has no operator URL to fall back to, so
# either disable vectordb or supply an explicit embed endpoint.
helm template r nemo_retriever/helm \
--set serviceConfig.vectordb.enabled=false > /tmp/r.yaml
# or:
# helm template r nemo_retriever/helm \
# --set serviceConfig.nimEndpoints.embedInvokeUrl=http://embed.svc:8000/v1/embeddings \
# > /tmp/r.yamlBoth renders should succeed cleanly and parse as valid Kubernetes manifests
(kubectl apply --dry-run=client -f /tmp/r.yaml). See VectorDB and the
embed endpoint for why
helm template r nemo_retriever/helm without flags is rejected as a
misconfiguration.