This repo provides Terraform configuration to bring up a EKS Kubernetes Cluster with the GPU operator and GPU nodes from scratch.
This module was created and tested on Linux and MacOS.
- VPC Network for EKS Cluster
- Subnets in VPC for EKS CLuster
- EKS Cluster
- 1x CPU nodepool
- 1x GPU nodepool
- Installs latest version of GPU Operator via Helm
- 1x KMS Key to encrypt cluster secrets
For more details on resources created and their default values, please see the Terraform Module Inputs section.
- Kubectl
- AWS CLI
- You must run
aws-configureonce on your machine to populate the default region present in~/.aws/config
- You must run
- AWS Account where you have permissions to create a cluster, IAM roles and networking
- Terraform (CLI)[https://developer.hashicorp.com/terraform/downloads]
- JQ
- The provisioning will fail without this step, as it is used to set up your Kubernetes configuration file file after the cluster is provisioned
- None. If you encounter any, please file a GitHub issue
This module assumes that you have a working terraform binary and active AWS credentials (admin access or finely scoped permissions with basic EC2, EKS, VPC and IAM creation permissions).
No Terraform Provider is setup for remote state management but can be added. We strongly encourage you configure remote state before running in production.
-
Clone the repo
git clone https://github.com/NVIDIA/nvidia-terraform-modules.git cd nvidia-terraform-modules/eks -
Ensure you have active credentials set with the AWS CLI.
aws configure -
Update
terraform.tfvarsto customize a parameter from its default value, please uncomment the line and change the content -
Run the below command to initialize the configured
terraform init -
Run the below command to see what will be applied
terraform plan -out tfplan -
Run the below command to apply the code against your AWS environment
terraform apply tfplan -
Connect to the cluster with
kubectlby running below command with updated cluster name and region after the cluster is createdaws eks update-kubeconfig --name <eks-cluster-name> --region <eks-region>
-
Run the beloe commands to delete all remaining AWS resources created by this module. You should see
Destroy complete!message after a few minutes.terraform destroy --auto-approve
Call the EKS module by adding this to an existing Terraform file:
module "nvidia-eks" {
source = "git::github.com/nvidia/nvidia-terraform-modules/eks"
cluster_name = "nvidia-eks"
}In a production environment, we suggest pinning to a known tag of this Terraform module All configurable options for this module are listed below. If you need additional values added, please open a pull request.
| Name | Version |
|---|---|
| terraform | >= 1.3.4 |
| aws | ~>5.93.0 |
| kubernetes | ~>2.19.0 |
| Name | Version |
|---|---|
| aws | ~>5.93.0 |
| helm | n/a |
| Name | Source | Version |
|---|---|---|
| ebs_csi_irsa_role | terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks | n/a |
| eks | terraform-aws-modules/eks/aws | 18.29.0 |
| vpc | terraform-aws-modules/vpc/aws | 4.0.2 |
| Name | Type |
|---|---|
| helm_release.gpu_operator | resource |
| helm_release.nim_operator | resource |
| aws_ami.lookup | data source |
| aws_availability_zones.available | data source |
| aws_eks_cluster.eks | data source |
| aws_instances.nodes | data source |
| aws_region.current | data source |
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| additional_node_security_groups_rules | List of additional security group rules to add to the node security group created | any |
{} |
no |
| additional_security_group_ids | list of additional security groups to add to nodes | list(any) |
[] |
no |
| additional_user_data | User data that is appended to the user data script after of the EKS bootstrap script. | string |
"" |
no |
| aws_profile | n/a | string |
"development" |
no |
| cidr_block | CIDR for VPC | string |
"10.0.0.0/16" |
no |
| cluster_name | n/a | string |
n/a | yes |
| cluster_version | Version of EKS to install on the control plane (Major and Minor version only, do not include the patch) | string |
"1.30" |
no |
| cpu_instance_type | CPU EC2 worker node instance type | string |
"t2.xlarge" |
no |
| cpu_node_pool_additional_user_data | User data that is appended to the user data script after of the EKS bootstrap script on EKS-managed GPU node pool. | string |
"" |
no |
| cpu_node_pool_delete_on_termination | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | bool |
true |
no |
| cpu_node_pool_root_disk_size_gb | The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node | number |
512 |
no |
| cpu_node_pool_root_volume_type | The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs | string |
"gp2" |
no |
| desired_count_cpu_nodes | Minimum number of CPU nodes in the Autoscaling Group | string |
"1" |
no |
| desired_count_gpu_nodes | Minimum number of GPU nodes in the Autoscaling Group | string |
"2" |
no |
| enable_dns_hostnames | Whether or not the Default VPC has DNS hostname support | bool |
true |
no |
| enable_dns_support | Whether or not the Default VPC has DNS support | bool |
true |
no |
| enable_nat_gateway | Should be true if you want to provision NAT Gateways for each of your private networks | bool |
true |
no |
| existing_vpc_details | Variables used for re-using existing VPC for vpc_id & subnet_id | object({ |
null |
no |
| gpu_ami_id | AMI ID of the EKS Ubuntu Image cooresponding to the region and version of the cluser. Not required as we do a lookup for this image | string |
"" |
no |
| gpu_instance_type | GPU EC2 worker node instance type | string |
"g6e.12xlarge" |
no |
| gpu_node_pool_additional_user_data | User data that is appended to the user data script after of the EKS bootstrap script on EKS-managed GPU node pool. | string |
"" |
no |
| gpu_node_pool_delete_on_termination | Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes | bool |
true |
no |
| gpu_node_pool_root_disk_size_gb | The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node | number |
512 |
no |
| gpu_node_pool_root_volume_type | The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs | string |
"gp2" |
no |
| gpu_operator_driver_version | The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. | string |
"570.124.06" |
no |
| gpu_operator_namespace | The namespace for the GPU operator deployment | string |
"gpu-operator" |
no |
| gpu_operator_version | Version of the GPU Operator to deploy. Defaults to latest available. | string |
"v25.3.0" |
no |
| install_gpu_operator | Whether to Install GPU Operator. Defaults to false available. | string |
"true" |
no |
| install_nim_operator | Whether to Install NIM Operator. Defaults to false available. | string |
"false" |
no |
| max_cpu_nodes | Maximum number of CPU nodes in the Autoscaling Group | string |
"2" |
no |
| max_gpu_nodes | Maximum number of GPU nodes in the Autoscaling Group | string |
"5" |
no |
| min_cpu_nodes | Minimum number of CPU nodes in the Autoscaling Group | string |
"0" |
no |
| min_gpu_nodes | Minimum number of GPU nodes in the Autoscaling Group | string |
"2" |
no |
| nim_operator_namespace | The namespace for the GPU operator deployment | string |
"nim-operator" |
no |
| nim_operator_version | Version of the GPU Operator to deploy. Defaults to latest available. | string |
"v1.0.1" |
no |
| private_subnets | List of subnet ranges for the Private VPC | list(any) |
[ |
no |
| public_subnets | List of subnet ranges for the Private VPC | list(any) |
[ |
no |
| region | AWS region to provision the Kubernetes Cluster | string |
"us-west-2" |
no |
| single_nat_gateway | Should be true if you want to provision a single shared NAT Gateway across all of your private networks | bool |
false |
no |
| ssh_key | n/a | string |
"" |
no |
| Name | Description |
|---|---|
| cluster_ca_certificate | n/a |
| cluster_endpoint | n/a |
| cpu_node_role_name | IAM Node Role Bane for CPU node pools |
| gpu_node_role_name | IAM Node Role Name for GPU node pools |
| kube_exec_api_version | n/a |
| kube_exec_args | n/a |
| kube_exec_command | n/a |
| nodes | n/a |
| oidc_endpoint | n/a |
| private_subnet_ids | n/a |
| public_subnet_ids | n/a |