Skip to content

Commit fc7eb73

Browse files
JPedro2vishwanaths
andauthored
adding kai-scheduler to the pack central community (#164)
Co-authored-by: Vishwanath S <[email protected]>
1 parent 6b5b03a commit fc7eb73

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+12614
-0
lines changed
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# KAI Scheduler
2+
KAI Scheduler is a robust, efficient, and scalable [Kubernetes scheduler](https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/) that optimizes GPU resource allocation for AI and machine learning workloads.
3+
4+
Designed to manage large-scale GPU clusters, including thousands of nodes, and high-throughput of workloads, makes the KAI Scheduler ideal for extensive and demanding environments.
5+
KAI Scheduler allows administrators of Kubernetes clusters to dynamically allocate GPU resources to workloads.
6+
7+
KAI Scheduler supports the entire AI lifecycle, from small, interactive jobs that require minimal resources to large training and inference, all within the same cluster.
8+
It ensures optimal resource allocation while maintaining resource fairness between the different consumers.
9+
It can run alongside other schedulers installed on the cluster.
10+
11+
## Key Features
12+
* **Batch Scheduling**: Ensure all pods in a group are scheduled simultaneously or not at all.
13+
* **Bin Packing & Spread Scheduling**: Optimize node usage either by minimizing fragmentation (bin-packing) or increasing resiliency and load balancing (spread scheduling).
14+
* **Workload Priority**: Prioritize workloads effectively within queues.
15+
* **Hierarchical Queues**: Manage workloads with two-level queue hierarchies for flexible organizational control.
16+
* **Resource distribution**: Customize quotas, over-quota weights, limits, and priorities per queue.
17+
* **Fairness Policies**: Ensure equitable resource distribution using Dominant Resource Fairness (DRF) and resource reclamation across queues.
18+
* **Workload Consolidation**: Reallocate running workloads intelligently to reduce fragmentation and increase cluster utilization.
19+
* **Elastic Workloads**: Dynamically scale workloads within defined minimum and maximum pod counts.
20+
* **Dynamic Resource Allocation (DRA)**: Support vendor-specific hardware resources through Kubernetes ResourceClaims (e.g., GPUs from NVIDIA or AMD).
21+
* **GPU Sharing**: Allow multiple workloads to efficiently share single or multiple GPUs, maximizing resource utilization.
22+
* **Cloud & On-premise Support**: Fully compatible with dynamic cloud infrastructures (including auto-scalers like Karpenter) as well as static on-premise deployments.
23+
24+
## Prerequisites
25+
Before installing KAI Scheduler, ensure you have:
26+
27+
- [NVIDIA GPU-Operator](https://docs.spectrocloud.com/integrations/packs/?pack=nvidia-gpu-operator-ai) pack installed in order to schedule workloads that request GPU resources
28+
29+
## Installation
30+
KAI Scheduler will be installed in `kai-scheduler` namespace.
31+
> ⚠️ When submitting workloads, make sure to use a dedicated namespace. Do not use the `kai-scheduler` namespace for workload submission.
32+
33+
34+
### Parameters
35+
36+
| Key | Description | Default |
37+
|-----|-------------|---------|
38+
| `global.registry` | OCI registry hosting KAI images | `ghcr.io/nvidia/kai-scheduler` |
39+
| `global.tag` | Global image tag override | `""` |
40+
| `global.imagePullPolicy` | Pull policy for all images | `IfNotPresent` |
41+
| `global.securityContext` | Pod/container securityContext defaults | `{}` |
42+
| `global.imagePullSecrets` | Image pull secrets for private registries | `[]` |
43+
| `global.leaderElection` | Enable leader election for components | `false` |
44+
| `global.gpuSharing` | Enable GPU sharing | `false` |
45+
| `global.clusterAutoscaling` | Enable autoscaling coordination support | `false` |
46+
| `global.resourceReservation.namespace` | Namespace for reservation pods | `kai-resource-reservation` |
47+
| `global.resourceReservation.appLabel` | App label for resource reservation | `kai-resource-reservation` |
48+
| `operator.qps` | Kubernetes client QPS | `50` |
49+
| `operator.burst` | Kubernetes client burst | `300` |
50+
| `podgrouper.queueLabelKey` | Pod label key for queue assignment | `kai.scheduler/queue` |
51+
| `scheduler.placementStrategy` | Scheduling strategy (`binpack` or `spread`) | `binpack` |
52+
| `admission.cdi` | Use Container Device Interface | `false` |
53+
| `admission.gpuPodRuntimeClassName` | Runtime class for GPU pods | `nvidia` |
54+
| `defaultQueue.createDefaultQueue` | Whether to create the default queue on install | `true` |
55+
| `defaultQueue.parentName` | Parent queue name | `default-parent-queue` |
56+
| `defaultQueue.childName` | Child queue name | `default-queue` |
57+
58+
59+
## Support & Breaking changes
60+
Refer to the [Breaking Changes](https://github.com/NVIDIA/KAI-Scheduler/blob/main/docs/migrationguides/README.md) doc for more info
61+
62+
## Quick Start
63+
To start scheduling workloads with KAI Scheduler, please continue to [Quick Start example](docs/quickstart/README.md)
Binary file not shown.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
apiVersion: v2
2+
appVersion: v0.10.0
3+
description: KAI Scheduler by NVIDIA
4+
name: kai-scheduler
5+
type: application
6+
version: v0.10.0

0 commit comments

Comments
 (0)