You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user/component-catalog.md
+1-4Lines changed: 1 addition & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,17 +15,14 @@ The source of truth is [`recipes/registry.yaml`](../../recipes/registry.yaml). E
15
15
|**aws-efa**| Device plugin for AWS Elastic Fabric Adapter. Enables low-latency networking on EKS clusters with EFA-capable instances. EKS-specific. |[AWS EFA K8s Device Plugin](https://github.com/aws/eks-charts)|
16
16
|**cert-manager**| Automates TLS certificate management. Required by several operators for webhook and API server certificates. |[cert-manager](https://github.com/cert-manager/cert-manager)|
17
17
|**skyhook-operator**| OS-level node tuning and configuration management. Applies kernel parameters, sysctl settings, and system-level optimizations to nodes. |[Skyhook](https://github.com/nvidia/skyhook)|
18
-
|**skyhook-customizations**| Custom tuning profiles applied via Skyhook. Extends the operator with environment-specific node configurations (kernel params, hugepages, etc.). | — |
19
18
|**nvsentinel**| GPU health monitoring and automated remediation. Detects GPU errors and can cordon or drain affected nodes. |[NVSentinel](https://github.com/NVIDIA/nvsentinel)|
20
19
|**nvidia-dra-driver-gpu**| Dynamic Resource Allocation driver for GPUs. Enables structured GPU device advertisement and claim-based allocation in Kubernetes 1.33+. |[NVIDIA DRA Driver](https://github.com/NVIDIA/k8s-dra-driver-gpu)|
21
20
|**kube-prometheus-stack**| Cluster monitoring: Prometheus, Grafana, Alertmanager, and node exporters. Provides GPU and cluster metrics collection and dashboards. |[kube-prometheus-stack](https://github.com/prometheus-community/helm-charts)|
22
21
|**prometheus-adapter**| Exposes custom metrics from Prometheus to the Kubernetes metrics API. Enables HPA scaling based on GPU utilization and other custom metrics. |[prometheus-adapter](https://github.com/kubernetes-sigs/prometheus-adapter)|
23
22
|**aws-ebs-csi-driver**| CSI driver for Amazon EBS volumes. Provides persistent storage for workloads on EKS. EKS-specific. |[AWS EBS CSI Driver](https://github.com/kubernetes-sigs/aws-ebs-csi-driver)|
24
23
|**k8s-ephemeral-storage-metrics**| Exports ephemeral storage usage metrics per pod. Useful for monitoring scratch space consumption on GPU nodes. |[k8s-ephemeral-storage-metrics](https://github.com/jmcgrath207/k8s-ephemeral-storage-metrics)|
25
24
|**kai-scheduler**| DRA-aware gang scheduler with hierarchical queues and topology-aware placement. Ensures distributed training jobs land on nodes with optimal interconnect topology. |[KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler)|
26
-
|**dynamo-crds**| Custom Resource Definitions for NVIDIA Dynamo inference serving. Installed separately from the platform to support CRD lifecycle management. |[Dynamo](https://github.com/ai-dynamo/dynamo)|
27
-
|**dynamo-platform**| NVIDIA Dynamo inference serving platform. Distributed inference with prefix-cache-aware routing and disaggregated prefill/decode. |[Dynamo](https://github.com/ai-dynamo/dynamo)|
28
-
|**kgateway-crds**| Custom Resource Definitions for kgateway (Kubernetes Gateway API implementation). |[kgateway](https://github.com/kgateway-dev/kgateway)|
25
+
|**dynamo**| NVIDIA Dynamo inference serving platform. Distributed inference with prefix-cache-aware routing and disaggregated prefill/decode. |[Dynamo](https://github.com/ai-dynamo/dynamo)|
29
26
|**kgateway**| Kubernetes Gateway API implementation. Provides model-aware ingress routing for inference workloads. |[kgateway](https://github.com/kgateway-dev/kgateway)|
30
27
|**kubeflow-trainer**| Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. |[Kubeflow Trainer](https://github.com/kubeflow/trainer)|
0 commit comments