@@ -10,15 +10,16 @@ jobs on AWS.
1010
1111## Requirements
1212
13- 1 . ` pip install boto3 `
14- 2 . ` git clone https://github.com/pytorch/elastic.git `
13+ 1 . ` git clone https://github.com/pytorch/elastic.git `
14+ 2 . ` cd elastic/aws && pip install -r requirements.txt `
15153 . The following AWS resources:
1616 1 . EC2 [ instance profile] ( https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html )
1717 2 . [ Subnet(s)] ( https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html#create-default-subnet )
1818 3 . [ Security group] ( https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html#DefaultSecurityGroup )
1919 4 . EFS volume
2020 5 . S3 Bucket
21-
21+ 4 . [ Install] ( https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html )
22+ the AWS Session Manager plugin
2223
2324## Quickstart
2425
@@ -69,7 +70,7 @@ you have downloaded the imagenet dataset to `/mnt/efs/fs1/data/imagenet/train`.
6970To run the script we'll use ` petctl ` ,
7071
7172``` bash
72- python3 petctl.py run_job --size 2 --min_size 1 --max_size 3 --name ${USER} -job examples/imagenet/main.py -- --input_path /mnt/efs/fs1/data/imagenet/train
73+ python3 aws/ petctl.py run_job --size 2 --min_size 1 --max_size 3 --name ${USER} -job examples/imagenet/main.py -- --input_path /mnt/efs/fs1/data/imagenet/train
7374```
7475
7576In the example above, the named arguments, such as, ` --size ` , ` --min_size ` , and
@@ -158,20 +159,20 @@ You can take a look at their console outputs by running
158159
159160``` bash
160161# see the status of the worker
161- systemctl status torchelastic_worker
162+ sudo systemctl status torchelastic_worker
162163# get the container id
163- docker ps
164+ sudo docker ps
164165# tail the container logs
165- docker logs -f < container id>
166+ sudo docker logs -f < container id>
166167```
167168
168169> Note since we have configured the log driver to be ` awslogs ` tailing
169170 the docker logs will not work. For more information see: https://docs.docker.com/config/containers/logging/awslogs/
170171
171172You can also manually stop and start the workers by running
172173``` bash
173- systemctl stop torchelastic_worker
174- systemctl start torchelastic_worker
174+ sudo systemctl stop torchelastic_worker
175+ sudo systemctl start torchelastic_worker
175176```
176177
177178> ** EXCERCISE:** Try stopping or adding worker(s) to see elasticity in action!
@@ -188,7 +189,7 @@ that is monitoring the job!). In practice consider using EKS, Batch, or SageMake
188189To stop the job and tear down the resources, use the ` kill_job ` command:
189190
190191``` bash
191- python3 petctl.py kill_job --name ${USER} -job
192+ python3 petctl.py kill_job ${USER} -job
192193```
193194
194195You'll notice that the two ASGs created with the ` run_job ` command are deleted.
0 commit comments