Skip to content

Commit f59d968

Browse files
authored
Multi-Node EKS Support mainline PR (#111)
* Multi-Node EKS Support
1 parent c9c851d commit f59d968

37 files changed

+3329
-1
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ repos:
6565
- id: check-json
6666
- id: check-toml
6767
- id: check-yaml
68-
exclude: ^Deployment/Kubernetes/[^/]+/chart/templates/.+$
68+
exclude: ^Deployment/Kubernetes/.+$
6969
- id: check-shebang-scripts-are-executable
7070
- id: end-of-file-fixer
7171
types_or: [c, c++, cuda, proto, textproto, java, python]
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Steps to create EKS cluster with EFS
2+
3+
## 1. Install CLIs
4+
5+
### a. Install AWS CLI (steps [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html))
6+
7+
```
8+
sudo apt install unzip
9+
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
10+
unzip awscliv2.zip
11+
sudo ./aws/install
12+
```
13+
14+
### b. Install Kubernetes CLI (steps [here](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html))
15+
16+
```
17+
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.30.0/2024-05-12/bin/linux/amd64/kubectl
18+
chmod +x ./kubectl
19+
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH
20+
echo 'export PATH=$HOME/bin:$PATH' >> ~/.bashrc
21+
```
22+
23+
### c. Install EKS CLI (steps [here](https://eksctl.io/installation/))
24+
25+
```
26+
ARCH=amd64
27+
PLATFORM=$(uname -s)_$ARCH
28+
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
29+
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
30+
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
31+
sudo mv /tmp/eksctl /usr/local/bin
32+
```
33+
34+
### d. Install Helm CLI (steps [here](https://docs.aws.amazon.com/eks/latest/userguide/helm.html))
35+
36+
```
37+
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 > get_helm.sh
38+
chmod 700 get_helm.sh
39+
./get_helm.sh
40+
```
41+
42+
## 2. Create an EKS cluster
43+
44+
In this example we create an EKS cluster consisting of two `g5.12xlarge` compute nodes, each with four NVIDIA A10G GPUs and `c5.2xlarge` CPU node as control plane. We also setup EFA between the compute nodes.
45+
46+
### a. Configure AWS CLI
47+
48+
```
49+
aws configure
50+
```
51+
52+
### b. Create a config file for EKS cluster creation
53+
54+
We have provided an example file here: [eks_cluster_config.yaml](./eks_cluster_config.yaml)
55+
56+
```
57+
apiVersion: eksctl.io/v1alpha5
58+
kind: ClusterConfig
59+
60+
metadata:
61+
name: wenhant-eks-cluster
62+
version: "1.30"
63+
region: us-east-1
64+
65+
availabilityZones:
66+
- us-east-1a
67+
- us-east-1b
68+
- us-east-1c
69+
- us-east-1d
70+
- us-east-1e
71+
- us-east-1f
72+
73+
iam:
74+
withOIDC: true
75+
76+
managedNodeGroups:
77+
- name: sys-nodes-2
78+
instanceType: c5.2xlarge
79+
minSize: 1
80+
desiredCapacity: 1
81+
maxSize: 1
82+
availabilityZones: ["us-east-1a"]
83+
iam:
84+
withAddonPolicies:
85+
imageBuilder: true
86+
autoScaler: true
87+
ebs: true
88+
efs: true
89+
awsLoadBalancerController: true
90+
cloudWatch: true
91+
albIngress: true
92+
93+
- name: efa-compute-ng-2
94+
instanceType: g5.12xlarge
95+
minSize: 1
96+
desiredCapacity: 1
97+
maxSize: 1
98+
volumeSize: 300
99+
efaEnabled: true
100+
privateNetworking: true
101+
availabilityZones: ["us-east-1a"]
102+
iam:
103+
withAddonPolicies:
104+
imageBuilder: true
105+
autoScaler: true
106+
ebs: true
107+
efs: true
108+
awsLoadBalancerController: true
109+
cloudWatch: true
110+
albIngress: true
111+
```
112+
113+
> [!NOTE]
114+
> We set `minSize` and `desiredCapacity` to be 1 because AWS does not create your cluster successfully if no nodes are available. For example, if you specify `desiredCapacity` to be 2 but there are no available 2 nodes, your cluster creation will fail due to timeout even though there are no errors. The easiest way to avoid this is to create the cluster with 1 node and increase the number of nodes later in the EKS console. After you increase number of nodes in your node groups, make sure GPU nodes are in the same subnet. This is required for EFA to work.
115+
116+
### c. Create the EKS cluster
117+
118+
```
119+
eksctl create cluster -f eks_cluster_config.yaml
120+
```
121+
122+
## 3. Create an EFS file system
123+
124+
To enable multiple pods deployed to multiple nodes to load shards of the same model so that they can used in coordination to serve inference request too large to loaded by a single GPU, we'll need a common, shared storage location. In Kubernetes, these common, shared storage locations are referred to as persistent volumes. Persistent volumes can be volume mapped in to any number of pods and then accessed by processes running inside of said pods as if they were part of the pod's file system. We will be using EFS as persistent volume.
125+
126+
Additionally, we will need to create a persistent-volume claim which can use to assign the persistent volume to a pod.
127+
### a. Create an IAM role
128+
129+
Follow the steps to create an IAM role for your EFS file system: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-iam-resources. This role will be used later when you install the EFS CSI Driver.
130+
131+
### b. Install EFS CSI driver
132+
133+
Install the EFS CSI Driver through the Amazon EKS add-on in AWS console: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-install-driver. Once it's done, check the Add-ons section in EKS console, you should see the driver is showing `Active` under Status.
134+
135+
### c. Create EFS file system
136+
137+
Follow the steps to create an EFS file system: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/efs-create-filesystem.md. Make sure you mount subnets in the last step correctly. This will affect whether your nodes are able to access the created EFS file system.
138+
139+
## 4. Test
140+
141+
Follow the steps to check if your EFS file system is working properly with your nodes: https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/multiple_pods. This test is going to mount your EFS file system on all of your available nodes and write a text file to the file system.
142+
143+
## 5. Create an PVC for the created EFS file system
144+
145+
We have provided an example in here: [pvc](./pvc/). This folder contains three files: `pv.yaml`, `claim.yaml`, and `storageclass.yaml`. Make sure you modify the `pv.yaml` file and change the `volumeHandle` value to your own EFS file system ID.
146+
147+
pv.yaml
148+
149+
```
150+
apiVersion: v1
151+
kind: PersistentVolume
152+
metadata:
153+
name: efs-pv
154+
spec:
155+
capacity:
156+
storage: 200Gi
157+
volumeMode: Filesystem
158+
accessModes:
159+
- ReadWriteMany
160+
persistentVolumeReclaimPolicy: Retain
161+
storageClassName: efs-sc
162+
csi:
163+
driver: efs.csi.aws.com
164+
volumeHandle: fs-0cf1f987d6f5af59c # Change to your own ID
165+
```
166+
167+
claim.yaml
168+
169+
```
170+
apiVersion: v1
171+
kind: PersistentVolumeClaim
172+
metadata:
173+
name: efs-claim
174+
spec:
175+
accessModes:
176+
- ReadWriteMany
177+
storageClassName: efs-sc
178+
resources:
179+
requests:
180+
storage: 200Gi
181+
```
182+
183+
storageclass.yaml
184+
185+
```
186+
kind: StorageClass
187+
apiVersion: storage.k8s.io/v1
188+
metadata:
189+
name: efs-sc
190+
provisioner: efs.csi.aws.com
191+
```
192+
193+
Run the below command to deploy:
194+
195+
```
196+
kubectl apply -f pvc/
197+
```

0 commit comments

Comments
 (0)