EKS Ready Kubernetes (EKS)

This repo provides Terraform configuration to bring up a EKS Kubernetes Cluster with the GPU operator and GPU nodes from scratch.

Tested on

This module was created and tested on Linux and MacOS.

Resources Created

VPC Network for EKS Cluster
Subnets in VPC for EKS CLuster
EKS Cluster
1x CPU nodepool
1x GPU nodepool
Installs latest version of GPU Operator via Helm
1x KMS Key to encrypt cluster secrets

For more details on resources created and their default values, please see the Terraform Module Inputs section.

Prerequisites

Kubectl
AWS CLI
- You must run aws-configure once on your machine to populate the default region present in ~/.aws/config
AWS Account where you have permissions to create a cluster, IAM roles and networking
Terraform (CLI)[https://developer.hashicorp.com/terraform/downloads]
JQ
- The provisioning will fail without this step, as it is used to set up your Kubernetes configuration file file after the cluster is provisioned

Issues

None. If you encounter any, please file a GitHub issue

Usage

This module assumes that you have a working terraform binary and active AWS credentials (admin access or finely scoped permissions with basic EC2, EKS, VPC and IAM creation permissions).

No Terraform Provider is setup for remote state management but can be added. We strongly encourage you configure remote state before running in production.

Clone the repo

git clone https://github.com/NVIDIA/nvidia-terraform-modules.git

cd nvidia-terraform-modules/eks

Ensure you have active credentials set with the AWS CLI.
```
aws configure
```
Update terraform.tfvars to customize a parameter from its default value, please uncomment the line and change the content
Run the below command to initialize the configured
```
terraform init
```
Run the below command to see what will be applied
```
terraform plan -out tfplan
```
Run the below command to apply the code against your AWS environment
```
terraform apply tfplan
```
Connect to the cluster with kubectl by running below command with updated cluster name and region after the cluster is created
```
aws eks update-kubeconfig --name <eks-cluster-name> --region <eks-region>
```

Cleaning up / Deleting resources

Run the beloe commands to delete all remaining AWS resources created by this module. You should see Destroy complete! message after a few minutes.
```
terraform destroy --auto-approve
```

Running as a module

Call the EKS module by adding this to an existing Terraform file:

module "nvidia-eks" {
  source       = "git::github.com/nvidia/nvidia-terraform-modules/eks" 
  cluster_name = "nvidia-eks"
}

In a production environment, we suggest pinning to a known tag of this Terraform module All configurable options for this module are listed below. If you need additional values added, please open a pull request.

Requirements

Name	Version
terraform	>= 1.3.4
aws	~>5.93.0
kubernetes	~>2.19.0

Providers

Name	Version
aws	~>5.93.0
helm	n/a

Modules

Name	Source	Version
ebs_csi_irsa_role	terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks	n/a
eks	terraform-aws-modules/eks/aws	18.29.0
vpc	terraform-aws-modules/vpc/aws	4.0.2

Resources

Name	Type
helm_release.gpu_operator	resource
helm_release.nim_operator	resource
aws_ami.lookup	data source
aws_availability_zones.available	data source
aws_eks_cluster.eks	data source
aws_instances.nodes	data source
aws_region.current	data source

Inputs

Name	Description	Type	Default	Required
additional_node_security_groups_rules	List of additional security group rules to add to the node security group created	`any`	`{}`	no
additional_security_group_ids	list of additional security groups to add to nodes	`list(any)`	`[]`	no
additional_user_data	User data that is appended to the user data script after of the EKS bootstrap script.	`string`	`""`	no
aws_profile	n/a	`string`	`"development"`	no
cidr_block	CIDR for VPC	`string`	`"10.0.0.0/16"`	no
cluster_name	n/a	`string`	n/a	yes
cluster_version	Version of EKS to install on the control plane (Major and Minor version only, do not include the patch)	`string`	`"1.30"`	no
cpu_instance_type	CPU EC2 worker node instance type	`string`	`"t2.xlarge"`	no
cpu_node_pool_additional_user_data	User data that is appended to the user data script after of the EKS bootstrap script on EKS-managed GPU node pool.	`string`	`""`	no
cpu_node_pool_delete_on_termination	Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes	`bool`	`true`	no
cpu_node_pool_root_disk_size_gb	The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node	`number`	`512`	no
cpu_node_pool_root_volume_type	The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs	`string`	`"gp2"`	no
desired_count_cpu_nodes	Minimum number of CPU nodes in the Autoscaling Group	`string`	`"1"`	no
desired_count_gpu_nodes	Minimum number of GPU nodes in the Autoscaling Group	`string`	`"2"`	no
enable_dns_hostnames	Whether or not the Default VPC has DNS hostname support	`bool`	`true`	no
enable_dns_support	Whether or not the Default VPC has DNS support	`bool`	`true`	no
enable_nat_gateway	Should be true if you want to provision NAT Gateways for each of your private networks	`bool`	`true`	no
existing_vpc_details	Variables used for re-using existing VPC for vpc_id & subnet_id	object({ vpc_id = string subnet_ids = list(string) })	`null`	no
gpu_ami_id	AMI ID of the EKS Ubuntu Image cooresponding to the region and version of the cluser. Not required as we do a lookup for this image	`string`	`""`	no
gpu_instance_type	GPU EC2 worker node instance type	`string`	`"g6e.12xlarge"`	no
gpu_node_pool_additional_user_data	User data that is appended to the user data script after of the EKS bootstrap script on EKS-managed GPU node pool.	`string`	`""`	no
gpu_node_pool_delete_on_termination	Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes	`bool`	`true`	no
gpu_node_pool_root_disk_size_gb	The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node	`number`	`512`	no
gpu_node_pool_root_volume_type	The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs	`string`	`"gp2"`	no
gpu_operator_driver_version	The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available.	`string`	`"570.124.06"`	no
gpu_operator_namespace	The namespace for the GPU operator deployment	`string`	`"gpu-operator"`	no
gpu_operator_version	Version of the GPU Operator to deploy. Defaults to latest available.	`string`	`"v25.3.0"`	no
install_gpu_operator	Whether to Install GPU Operator. Defaults to false available.	`string`	`"true"`	no
install_nim_operator	Whether to Install NIM Operator. Defaults to false available.	`string`	`"false"`	no
max_cpu_nodes	Maximum number of CPU nodes in the Autoscaling Group	`string`	`"2"`	no
max_gpu_nodes	Maximum number of GPU nodes in the Autoscaling Group	`string`	`"5"`	no
min_cpu_nodes	Minimum number of CPU nodes in the Autoscaling Group	`string`	`"0"`	no
min_gpu_nodes	Minimum number of GPU nodes in the Autoscaling Group	`string`	`"2"`	no
nim_operator_namespace	The namespace for the GPU operator deployment	`string`	`"nim-operator"`	no
nim_operator_version	Version of the GPU Operator to deploy. Defaults to latest available.	`string`	`"v1.0.1"`	no
private_subnets	List of subnet ranges for the Private VPC	`list(any)`	[ "10.0.0.0/19", "10.0.32.0/19", "10.0.64.0/19" ]	no
public_subnets	List of subnet ranges for the Private VPC	`list(any)`	[ "10.0.96.0/19", "10.0.128.0/19", "10.0.160.0/19" ]	no
region	AWS region to provision the Kubernetes Cluster	`string`	`"us-west-2"`	no
single_nat_gateway	Should be true if you want to provision a single shared NAT Gateway across all of your private networks	`bool`	`false`	no
ssh_key	n/a	`string`	`""`	no

Outputs

Name	Description
cluster_ca_certificate	n/a
cluster_endpoint	n/a
cpu_node_role_name	IAM Node Role Bane for CPU node pools
gpu_node_role_name	IAM Node Role Name for GPU node pools
kube_exec_api_version	n/a
kube_exec_args	n/a
kube_exec_command	n/a
nodes	n/a
oidc_endpoint	n/a
private_subnet_ids	n/a
public_subnet_ids	n/a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS Ready Kubernetes (EKS)

Tested on

Resources Created

Prerequisites

Issues

Usage

Cleaning up / Deleting resources

Running as a module

Requirements

Providers

Modules

Resources

Inputs

Outputs

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

EKS Ready Kubernetes (EKS)

Tested on

Resources Created

Prerequisites

Issues

Usage

Cleaning up / Deleting resources

Running as a module

Requirements

Providers

Modules

Resources

Inputs

Outputs