This repository automates the deployment of a private Azure Kubernetes Service (AKS) cluster with NVIDIA H100 GPUs, using Bicep and Helm. It also installs the NVIDIA GPU Operator for GPU workload support.
- Deploys a private AKS cluster using Bicep (
main.bicep
) - Sets up an Azure Container Registry (ACR)
- Installs the NVIDIA GPU Operator using Helm
- Verifies GPU access with a test pod running
nvidia-smi
Make sure the following tools are installed:
chmod +x run.sh
./run.sh
Update the variables inside run.sh
to match your environment:
RESOURCE_GROUP="rg-pvt-aks-h100"
LOCATION="eastus2"
PARAM_REGISTRY_NAME="gbbpvt"
PARAM_CLUSTER_NAME="pvt-aks-h100"
main.bicep
– Main Bicep template for deploying infrastructurerun.sh
– Script to deploy AKS and install GPU Operatorpod-check-nvidia-smi.yml
– Pod spec to verify GPU access
Step | File | Description |
---|---|---|
1 | .github/workflows/deploy.yml |
Deploy infrastructure |
2 | .github/workflows/attach-aks-to-acr.yml |
Attach AKS to ACR using managed identity |
3 | .github/workflows/nvidia-gpu-operator.yml |
Install NVIDIA GPU Operator |
4 | .github/workflows/test.yml |
Run GPU test using nvidia-smi |
9999 | .github/workflows/delete-resources.yml |
Delete deployed resources |
Fork this repo to your GitHub account.
MI_NAME="github-actions-identity"
RESOURCE_GROUP_MI="rg-github-actions-identity"
LOCATION="eastus2"
REGISTRY_NAME="gbbpvt"
az group create --name "$RESOURCE_GROUP_MI" --location "$LOCATION"
az identity create \
--name "$MI_NAME" \
--resource-group "$RESOURCE_GROUP_MI" \
--location "$LOCATION"
RESOURCE_GROUP="rg-pvt-aks-h100"
az group create --name "$RESOURCE_GROUP" --location "$LOCATION"
CLIENT_ID=$(az identity show -g "$RESOURCE_GROUP_MI" -n "$MI_NAME" --query clientId -o tsv)
SUBSCRIPTION_ID=$(az account show --query id -o tsv)
TENANT_ID=$(az account show --query tenantId -o tsv)
MI_PRINCIPAL_ID=$(az identity show -g "$RESOURCE_GROUP_MI" -n "$MI_NAME" --query principalId -o tsv)
az role assignment create \
--assignee-object-id "$MI_PRINCIPAL_ID" \
--role Contributor \
--scope /subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP
GITHUB_ORG="appdevgbb"
REPO="pattern-private-aks-gpu"
az identity federated-credential create \
--name github-actions \
--identity-name "$MI_NAME" \
--resource-group "$RESOURCE_GROUP_MI" \
--issuer "https://token.actions.githubusercontent.com" \
--subject "repo:$GITHUB_ORG/$REPO:ref:refs/heads/main" \
--audiences "api://AzureADTokenExchange"
Go to your GitHub repo:
Settings → Secrets and variables → Actions → New repository secret
Create these secrets:
AZURE_CLIENT_ID
AZURE_TENANT_ID
AZURE_SUBSCRIPTION_ID
Use the values from Step 3.
After deployment, grant the GitHub Actions identity permission to assign roles to the AKS kubelet:
ACR_ID=$(az acr show -n "$REGISTRY_NAME" -g "$RESOURCE_GROUP" --query id -o tsv)
az role assignment create \
--assignee-object-id "$MI_PRINCIPAL_ID" \
--role "User Access Administrator" \
--scope "$ACR_ID"
Then run the workflow:
.github/workflows/attach-aks-to-acr.yml
Once everything is deployed:
- Check GPU readiness by applying
pod-check-nvidia-smi.yml
- Use
kubectl logs
to verify the output ofnvidia-smi
To clean up resources:
gh workflow run delete-resources.yml
Or manually delete them using the Azure CLI or portal.
MIT License. See LICENSE for details.