Skip to content

Latest commit

 

History

History
195 lines (153 loc) · 14.9 KB

File metadata and controls

195 lines (153 loc) · 14.9 KB

EKS Ready Kubernetes (EKS)

This repo provides Terraform configuration to bring up a EKS Kubernetes Cluster with the GPU operator and GPU nodes from scratch.

Tested on

This module was created and tested on Linux and MacOS.

Resources Created

  • VPC Network for EKS Cluster
  • Subnets in VPC for EKS CLuster
  • EKS Cluster
  • 1x CPU nodepool
  • 1x GPU nodepool
  • Installs latest version of GPU Operator via Helm
  • 1x KMS Key to encrypt cluster secrets

For more details on resources created and their default values, please see the Terraform Module Inputs section.

Prerequisites

  1. Kubectl
  2. AWS CLI
    • You must run aws-configure once on your machine to populate the default region present in ~/.aws/config
  3. AWS Account where you have permissions to create a cluster, IAM roles and networking
  4. Terraform (CLI)[https://developer.hashicorp.com/terraform/downloads]
  5. JQ
    • The provisioning will fail without this step, as it is used to set up your Kubernetes configuration file file after the cluster is provisioned

Issues

  • None. If you encounter any, please file a GitHub issue

Usage

This module assumes that you have a working terraform binary and active AWS credentials (admin access or finely scoped permissions with basic EC2, EKS, VPC and IAM creation permissions).

No Terraform Provider is setup for remote state management but can be added. We strongly encourage you configure remote state before running in production.

  1. Clone the repo

    git clone https://github.com/NVIDIA/nvidia-terraform-modules.git
    
    cd nvidia-terraform-modules/eks
    
  2. Ensure you have active credentials set with the AWS CLI.

    aws configure
    
  3. Update terraform.tfvars to customize a parameter from its default value, please uncomment the line and change the content

  4. Run the below command to initialize the configured

    terraform init
    
  5. Run the below command to see what will be applied

    terraform plan -out tfplan
    
  6. Run the below command to apply the code against your AWS environment

    terraform apply tfplan
    
  7. Connect to the cluster with kubectl by running below command with updated cluster name and region after the cluster is created

    aws eks update-kubeconfig --name <eks-cluster-name> --region <eks-region>
    

Cleaning up / Deleting resources

  1. Run the beloe commands to delete all remaining AWS resources created by this module. You should see Destroy complete! message after a few minutes.

    terraform destroy --auto-approve
    

Running as a module

Call the EKS module by adding this to an existing Terraform file:

module "nvidia-eks" {
  source       = "git::github.com/nvidia/nvidia-terraform-modules/eks" 
  cluster_name = "nvidia-eks"
}

In a production environment, we suggest pinning to a known tag of this Terraform module All configurable options for this module are listed below. If you need additional values added, please open a pull request.

Requirements

Name Version
terraform >= 1.3.4
aws ~>5.93.0
kubernetes ~>2.19.0

Providers

Name Version
aws ~>5.93.0
helm n/a

Modules

Name Source Version
ebs_csi_irsa_role terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks n/a
eks terraform-aws-modules/eks/aws 18.29.0
vpc terraform-aws-modules/vpc/aws 4.0.2

Resources

Name Type
helm_release.gpu_operator resource
helm_release.nim_operator resource
aws_ami.lookup data source
aws_availability_zones.available data source
aws_eks_cluster.eks data source
aws_instances.nodes data source
aws_region.current data source

Inputs

Name Description Type Default Required
additional_node_security_groups_rules List of additional security group rules to add to the node security group created any {} no
additional_security_group_ids list of additional security groups to add to nodes list(any) [] no
additional_user_data User data that is appended to the user data script after of the EKS bootstrap script. string "" no
aws_profile n/a string "development" no
cidr_block CIDR for VPC string "10.0.0.0/16" no
cluster_name n/a string n/a yes
cluster_version Version of EKS to install on the control plane (Major and Minor version only, do not include the patch) string "1.30" no
cpu_instance_type CPU EC2 worker node instance type string "t2.xlarge" no
cpu_node_pool_additional_user_data User data that is appended to the user data script after of the EKS bootstrap script on EKS-managed GPU node pool. string "" no
cpu_node_pool_delete_on_termination Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes bool true no
cpu_node_pool_root_disk_size_gb The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node number 512 no
cpu_node_pool_root_volume_type The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs string "gp2" no
desired_count_cpu_nodes Minimum number of CPU nodes in the Autoscaling Group string "1" no
desired_count_gpu_nodes Minimum number of GPU nodes in the Autoscaling Group string "2" no
enable_dns_hostnames Whether or not the Default VPC has DNS hostname support bool true no
enable_dns_support Whether or not the Default VPC has DNS support bool true no
enable_nat_gateway Should be true if you want to provision NAT Gateways for each of your private networks bool true no
existing_vpc_details Variables used for re-using existing VPC for vpc_id & subnet_id
object({
vpc_id = string
subnet_ids = list(string)
})
null no
gpu_ami_id AMI ID of the EKS Ubuntu Image cooresponding to the region and version of the cluser. Not required as we do a lookup for this image string "" no
gpu_instance_type GPU EC2 worker node instance type string "g6e.12xlarge" no
gpu_node_pool_additional_user_data User data that is appended to the user data script after of the EKS bootstrap script on EKS-managed GPU node pool. string "" no
gpu_node_pool_delete_on_termination Delete the VM nodes root filesystem on each node of the instance type. This is set to true by default, but can be changed when desired when using the 'local-storage provisioner' and are keeping important application data on the nodes bool true no
gpu_node_pool_root_disk_size_gb The size of the root disk on all GPU nodes in the EKS-managed GPU-only Node Pool. This is primarily for container image storage on the node number 512 no
gpu_node_pool_root_volume_type The type of disk to use for the GPU node pool root disk (eg. gp2, gp3). Note, this is different from the type of disk used by applications via EKS Storage classes/PVs & PVCs string "gp2" no
gpu_operator_driver_version The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available. string "570.124.06" no
gpu_operator_namespace The namespace for the GPU operator deployment string "gpu-operator" no
gpu_operator_version Version of the GPU Operator to deploy. Defaults to latest available. string "v25.3.0" no
install_gpu_operator Whether to Install GPU Operator. Defaults to false available. string "true" no
install_nim_operator Whether to Install NIM Operator. Defaults to false available. string "false" no
max_cpu_nodes Maximum number of CPU nodes in the Autoscaling Group string "2" no
max_gpu_nodes Maximum number of GPU nodes in the Autoscaling Group string "5" no
min_cpu_nodes Minimum number of CPU nodes in the Autoscaling Group string "0" no
min_gpu_nodes Minimum number of GPU nodes in the Autoscaling Group string "2" no
nim_operator_namespace The namespace for the GPU operator deployment string "nim-operator" no
nim_operator_version Version of the GPU Operator to deploy. Defaults to latest available. string "v1.0.1" no
private_subnets List of subnet ranges for the Private VPC list(any)
[
"10.0.0.0/19",
"10.0.32.0/19",
"10.0.64.0/19"
]
no
public_subnets List of subnet ranges for the Private VPC list(any)
[
"10.0.96.0/19",
"10.0.128.0/19",
"10.0.160.0/19"
]
no
region AWS region to provision the Kubernetes Cluster string "us-west-2" no
single_nat_gateway Should be true if you want to provision a single shared NAT Gateway across all of your private networks bool false no
ssh_key n/a string "" no

Outputs

Name Description
cluster_ca_certificate n/a
cluster_endpoint n/a
cpu_node_role_name IAM Node Role Bane for CPU node pools
gpu_node_role_name IAM Node Role Name for GPU node pools
kube_exec_api_version n/a
kube_exec_args n/a
kube_exec_command n/a
nodes n/a
oidc_endpoint n/a
private_subnet_ids n/a
public_subnet_ids n/a