Skip to content

Conversation

biswapanda
Copy link
Contributor

@biswapanda biswapanda commented Oct 13, 2025

Overview:

This directory contains a pre-deployment check script that verifies if Kubernetes cluster meets the requirements for deploying Dynamo.

includes #3584, #3574

Sample Successful check

========================================
  Dynamo Pre-Deployment Check Script  
========================================


--- Checking kubectl connectivity ---
✅ kubectl is available and cluster is accessible

--- Checking for default StorageClass ---
✅ Default StorageClass found
  - NAME                               PROVISIONER              RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
compute-csi-default-sc (default)   compute.csi.nebius.com   Delete          WaitForFirstConsumer   true                   66d

--- Checking cluster GPU resources ---
✅ Found 16 GPU node(s) in the cluster

--- Checking GPU operator ---
✅ GPU operator is running (1/1 pods)


--- Pre-Deployment Check Summary ---
✅ kubectl Connectivity: PASSED
✅ Default StorageClass: PASSED
✅ Cluster GPU Resources: PASSED
✅ GPU Operator: PASSED

Summary: 4 passed, 0 failed
🎉 All pre-deployment checks passed!
Your cluster is ready for Dynamo deployment.

Sample Failed check

========================================
  Dynamo Pre-Deployment Check Script  
========================================


--- Checking kubectl connectivity ---
✅ kubectl is available and cluster is accessible

--- Checking for default StorageClass ---
❌ No default StorageClass found

Dynamo requires a default StorageClass for persistent volume provisioning.
Please follow the instructions below to configure a default StorageClass before proceeding with deployment.

Available StorageClasses in your cluster:
NAME                               PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
compute-csi-default-sc (default)   compute.csi.nebius.com          Delete          WaitForFirstConsumer   true                   65d
csi-mounted-fs-path-sc             mounted-fs-path.csi.nebius.ai   Delete          WaitForFirstConsumer   false                  59d

To set a StorageClass as default, use the following command:
kubectl patch storageclass <storage-class-name> -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Example with your first available StorageClass:
kubectl patch storageclass compute-csi-default-sc -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

For more information on managing default StorageClasses, visit:
https://kubernetes.io/docs/tasks/administer-cluster/change-default-storage-class/

--- Checking cluster gpu resources ---
✅ Found 17 gpu node(s) in the cluster


--- Pre-Deployment Check Summary ---
✅ kubectl Connectivity: PASSED
❌ Default StorageClass: FAILED
✅ Cluster Resources: PASSED

Summary: 2 passed, 1 failed
❌ 1 pre-deployment check(s) failed.
Please address the issues above before proceeding with deployment.

Where should the reviewer start?

  • deploy/cloud/pre-deployment/pre-deployment-check.sh
  • deploy/cloud/pre-deployment/README.md

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

@biswapanda biswapanda requested a review from a team as a code owner October 13, 2025 02:46
@biswapanda biswapanda self-assigned this Oct 13, 2025
@github-actions github-actions bot added the feat label Oct 13, 2025
@biswapanda biswapanda enabled auto-merge (squash) October 13, 2025 02:46
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

Walkthrough

Introduces a Bash pre-deployment check script for Kubernetes-based Dynamo deployments and a README documenting its usage. The script validates kubectl connectivity, default StorageClass presence, and GPU-enabled nodes, aggregates results, prints a summary, and sets exit status based on failures.

Changes

Cohort / File(s) Summary of Changes
Documentation
deploy/cloud/pre-deployment/README.md
Added README explaining the pre-deployment check script: usage, checks (kubectl, StorageClass, GPU nodes), sample outputs, status tables, troubleshooting, and reference link.
Pre-deployment Checker Script
deploy/cloud/pre-deployment/pre-deployment-check.sh
New Bash script implementing pre-deployment checks with colored output, ordered execution, result aggregation, guidance for StorageClass defaults, GPU node detection, and non-zero exit on failure.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant U as User
    participant S as pre-deployment-check.sh
    participant K as kubectl
    participant C as Kubernetes Cluster

    U->>S: Run script
    activate S
    Note over S: Initialize CHECK_ORDER and CHECK_RESULTS

    rect rgba(200,235,255,0.3)
    S->>K: kubectl version / cluster-info
    K->>C: Connect
    C-->>K: Response / Error
    K-->>S: Connectivity result
    Note over S: Record PASS/FAIL (kubectl connectivity)
    end

    rect rgba(220,255,220,0.3)
    S->>K: kubectl get storageclass -o json
    K->>C: Query StorageClasses
    C-->>K: SC list
    K-->>S: SC data
    Note over S: Check default SC presence / multiple defaults<br/>Provide commands if missing/multiple
    end

    rect rgba(255,235,200,0.3)
    S->>K: kubectl get nodes -L nvidia.com/gpu.present
    K->>C: Query nodes/labels
    C-->>K: Node list
    K-->>S: Labeled GPU node count
    Note over S: Record GPU availability
    end

    S-->>U: Print per-check results and summary
    S-->>U: Exit 0 if all PASS, else non-zero
    deactivate S
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I thump my paws, the checks begin—
Kubes replies with cheeky grin.
StorageClass crowned? Or none to see?
GPUs blink: “Present? It’s me!”
With carrots cached and logs in tow,
I ship the bits—ears up—let’s go! 🥕🐇

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description includes the Overview, reviewer entry points, and Related Issues sections but omits the required “#### Details” section from the repository’s template that should concisely describe the actual changes introduced by the PR. Without this section, readers lack a clear, dedicated summary of what was modified, added, or removed. Adding the Details section will ensure the description fully matches the template structure and clearly communicates the scope of the new pre-deployment check script and README updates. Please add a “#### Details” section that outlines the specific modifications made in this PR, including the new Bash pre-deployment check script, README documentation updates, and any usage or troubleshooting guidance. Ensuring this section is present will align the description with the repository template and provide reviewers with the necessary context for the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title “feat: add pre-deployment checks” is directly aligned with the primary change of introducing a pre-deployment checking script and accompanying README, clearly summarizing the new feature without unnecessary detail or ambiguity.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
deploy/cloud/pre-deployment/README.md (1)

33-103: Add language identifiers to fenced blocks.

markdownlint is flagging the output/tips code fences because they lack language specifiers (MD040). Please tag them with something like text or console so the doc passes lint.

Also applies to: 129-137

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90dc758 and 97a6c5f.

📒 Files selected for processing (2)
  • deploy/cloud/pre-deployment/README.md (1 hunks)
  • deploy/cloud/pre-deployment/pre-deployment-check.sh (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Copyright Checks
deploy/cloud/pre-deployment/pre-deployment-check.sh

[error] 1-1: Copyright check failed: Invalid/Missing Header detected in deploy/cloud/pre-deployment/pre-deployment-check.sh.

🪛 markdownlint-cli2 (0.18.1)
deploy/cloud/pre-deployment/README.md

33-33: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


76-76: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


129-129: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)
deploy/cloud/pre-deployment/pre-deployment-check.sh

[warning] 97-97: provisioner appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo

@biswapanda biswapanda changed the title feat: add pre-deployment check for storageclass feat: add pre-deployment checks Oct 13, 2025
@biswapanda
Copy link
Contributor Author

/ok to test 9f5edfc

@biswapanda
Copy link
Contributor Author

/ok to test 969fec8

Copy link
Contributor

@mohammedabdulwahhab mohammedabdulwahhab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, It would be interesting to run this as a pre-install hook for dynamo cloud. That way customers will have this in the critical path.

@biswapanda
Copy link
Contributor Author

/ok to test 95fa843

@biswapanda
Copy link
Contributor Author

/ok to test 95fa843

}

# Global variables to track check results
declare -A CHECK_RESULTS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running on my local mac against the Nebius cluster. I immediately hit:

./deploy/cloud/pre-deployment/pre-deployment-check.sh: line 180: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]

Cursor tells me:

Ah! The issue is that macOS ships with bash 3.2 by default, which doesn't support associative arrays (declare -A). Associative arrays were introduced in bash 4.0+.

Things work if I just a parallel index array:
declare -a with refactoring where CHECK_RESULTS is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants