-
Notifications
You must be signed in to change notification settings - Fork 638
feat: add pre-deployment checks #3573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
biswapanda
wants to merge
10
commits into
main
Choose a base branch
from
bis/dep-461-check-default-storage-class-before-deployment
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,162
−38
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
97a6c5f
feat: add pre-deployment check for storageclass
biswapanda f2a4924
fix license
biswapanda e70032b
fix
biswapanda 9f5edfc
gpu resources
biswapanda e8afa74
feat: dynamo pre deployment checks (#3574)
biswapanda d2b3941
feat: guides for nixl benchmarking (#3584)
biswapanda 969fec8
add link to pre-deployment checks
biswapanda 0360a21
Merge branch 'main' into bis/dep-461-check-default-storage-class-befo…
biswapanda 95fa843
add link for nccl/nixl tests
biswapanda b27259b
Merge branch 'main' into bis/dep-461-check-default-storage-class-befo…
biswapanda File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
<!-- | ||
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
SPDX-License-Identifier: Apache-2.0 | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
--> | ||
|
||
# Pre-Deployment Check Script | ||
|
||
This directory contains a pre-deployment check script that verifies your Kubernetes cluster meets the requirements for deploying Dynamo. | ||
|
||
- For NCCL tests, please refer to the [NCCL tests](https://docs.nebius.com/kubernetes/gpu/nccl-test#run-tests) for more details. | ||
|
||
- For NIXL benchmark, please refer to the [NIXL benchmark pre-deployment checks](/deploy/cloud/pre-deployment/nixl/README.md) for more details. | ||
|
||
## Usage | ||
|
||
Run the pre-deployment check before deploying Dynamo: | ||
|
||
```bash | ||
./pre-deployment-check.sh | ||
``` | ||
|
||
## What it checks | ||
|
||
The script performs few checks and provides a detailed summary: | ||
|
||
### 1. kubectl Connectivity | ||
- Verifies that `kubectl` is installed and kubectl can connect to your Kubernetes cluster | ||
|
||
### 2. Default StorageClass | ||
- Verifies that a default StorageClass is configured in your cluster | ||
- If no default StorageClass is found: | ||
- Lists all available StorageClasses in the cluster with full details | ||
- Provides a sample command to set a StorageClass as default | ||
- References the official Kubernetes documentation for detailed guidance | ||
|
||
### 3. Cluster GPU Resources | ||
- Checks for GPU-enabled nodes in the cluster using label `nvidia.com/gpu.present=true` | ||
|
||
## Sample Output | ||
|
||
### Complete Script Output Example: | ||
``` | ||
======================================== | ||
Dynamo Pre-Deployment Check Script | ||
======================================== | ||
--- Checking kubectl connectivity --- | ||
✅ kubectl is available and cluster is accessible | ||
--- Checking for default StorageClass --- | ||
❌ No default StorageClass found | ||
Dynamo requires a default StorageClass for persistent volume provisioning. | ||
Please configure a default StorageClass before proceeding with deployment. | ||
Available StorageClasses in your cluster: | ||
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE | ||
my-default-storage-class (default) compute.csi.mock Delete WaitForFirstConsumer true 65d | ||
fast-ssd-storage kubernetes.io/gce-pd Delete Immediate true 30d | ||
To set a StorageClass as default, use the following command: | ||
kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' | ||
Example with your first available StorageClass: | ||
kubectl patch storageclass my-default-storage-class -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' | ||
For more information on managing default StorageClasses, visit: | ||
https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/ | ||
--- Checking cluster gpu resources --- | ||
✅ Found 17 gpu node(s) in the cluster | ||
Node information: | ||
--- Pre-Deployment Check Summary --- | ||
✅ kubectl Connectivity: PASSED | ||
❌ Default StorageClass: FAILED | ||
✅ Cluster Resources: PASSED | ||
Summary: 2 passed, 1 failed | ||
❌ 1 pre-deployment check(s) failed. | ||
Please address the issues above before proceeding with deployment. | ||
``` | ||
|
||
### When all checks pass: | ||
``` | ||
======================================== | ||
Dynamo Pre-Deployment Check Script | ||
======================================== | ||
--- Checking kubectl connectivity --- | ||
✅ kubectl is available and cluster is accessible | ||
--- Checking for default StorageClass --- | ||
✅ Default StorageClass found | ||
- NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE | ||
my-default-storage-class (default) compute.csi.mock Delete WaitForFirstConsumer true 65d | ||
--- Checking cluster gpu resources --- | ||
✅ Found 17 gpu node(s) in the cluster | ||
Node information: | ||
--- Pre-Deployment Check Summary --- | ||
✅ kubectl Connectivity: PASSED | ||
✅ Default StorageClass: PASSED | ||
✅ Cluster Resources: PASSED | ||
Summary: 3 passed, 0 failed | ||
🎉 All pre-deployment checks passed! | ||
Your cluster is ready for Dynamo deployment. | ||
``` | ||
|
||
## Check Status Summary | ||
|
||
The script provides a comprehensive summary showing the status of each check: | ||
|
||
| Check Name | Description | Pass/Fail Status | | ||
|------------|-------------|------------------| | ||
| **kubectl Connectivity** | Verifies kubectl installation and cluster access | ✅ PASSED / ❌ FAILED | | ||
| **Default StorageClass** | Checks for default StorageClass annotation | ✅ PASSED / ❌ FAILED | | ||
| **Cluster Resources** | Validates GPU nodes availability | ✅ PASSED / ❌ FAILED | | ||
|
||
## Setting a Default StorageClass | ||
|
||
If you need to set a default StorageClass, use the following command: | ||
|
||
```bash | ||
kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' | ||
``` | ||
|
||
Replace `<storage-class-name>` with the name of your desired StorageClass. | ||
|
||
## Troubleshooting | ||
|
||
### Multiple Default StorageClasses | ||
If you have multiple StorageClasses marked as default, the script will warn you: | ||
``` | ||
⚠️ Warning: Multiple default StorageClasses detected | ||
This may cause unpredictable behavior. Consider having only one default StorageClass. | ||
``` | ||
|
||
To remove the default annotation from a StorageClass: | ||
```bash | ||
kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' | ||
``` | ||
|
||
### No GPU Nodes Found | ||
If no GPU nodes are found, ensure your cluster has nodes with the `nvidia.com/gpu.present=true` label. | ||
|
||
### No StorageClasses Available | ||
If no StorageClasses are available in your cluster, you'll need to: | ||
1. Install a storage provisioner (e.g., for cloud providers, local storage, etc.) | ||
2. Create appropriate StorageClass resources | ||
3. Mark one as default | ||
|
||
## Reference | ||
|
||
For more information on managing default StorageClasses, visit: | ||
[Kubernetes Documentation - Change the default StorageClass](https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.