Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Commit 6d6e44a

Browse files
Kiuk Chungfacebook-github-bot
authored andcommitted
Make setup.yml cfn template support multiple stacks in a single account, fix bug in petctl setp where None was being passed to cfn param, pump docker logs to cloudwatch
Summary: 1. Uses docker log-driver == awslogs to make docker output go to cloud watch (see screenshots below) 2. #1 creates a log group called `torchelastic/$USER` in CW and creates log streams (one per worker) called `$job_name/$instance_id` 3. Fixes a bug in `petctl setup` where if no efs and s3 buckets are specified the `NoneType` is passed to the cfn template param which throws a validation error because it expects a string 4. Fixes an issue with cfn template where the CloudWatch IAM managed policy was being created with a specific name hence preventing multiple stacks from being created in the same account. #thanks Vinicius Reis for testing `petctl` and reporting bugs #3 and #4. {F223965947} {F223965943} Reviewed By: vreis Differential Revision: D18826855 fbshipit-source-id: 2d75f607734135ab6d5301fc636501a38cfee9d9
1 parent 0fd1b10 commit 6d6e44a

5 files changed

Lines changed: 35 additions & 15 deletions

File tree

aws/README.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,13 @@ Auto Scaling Groups
133133
1. etcd server
134134
2. workers
135135

136-
#### SSH onto worker nodes
137-
To SSH onto the worker nodes to inspect the worker process we use AWS
136+
#### Inspect the logs
137+
Log into the AWS CloudWatch Logs console. You should see a log group called
138+
`torchelastic/$USER`. Under it there will be a log stream per instance with the
139+
name `$job_name/$instance_id` (e.g. `my_job/i0b938EXAMPLE`).
140+
141+
#### Troubleshooting
142+
To SSH onto the worker nodes to debug/inspect the worker process use AWS
138143
Session Manager instead of the ec2 key pair. [Install](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html)
139144
the Session Manager plugin and run
140145

@@ -160,18 +165,18 @@ docker ps
160165
docker logs -f <container id>
161166
```
162167

168+
> Note since we have configured the log driver to be `awslogs` tailing
169+
the docker logs will not work. For more information see: https://docs.docker.com/config/containers/logging/awslogs/
170+
163171
You can also manually stop and start the workers by running
164172
``` bash
165173
systemctl stop torchelastic_worker
166174
systemctl start torchelastic_worker
167175
```
168176

169-
> **EXCERCISE:** Open up two terminals and SSH onto each worker. Tail the docker logs
170-
on each worker. Now stop worker 1 and observe the worker 2 re-rendezvous and
171-
since `--min_size=1` it continues training by itself. Now restart worker 1 and
172-
observe that worker 2 notices that worker 1 is waiting to join and re-rendezvous,
173-
the `state` object in worker 2 is `sync()`'ed to worker 1 and both resume training
174-
without loss of progress.
177+
> **EXCERCISE:** Try stopping or adding worker(s) to see elasticity in action!
178+
To add workers, simply increase the `desired` size of the worker autoscaling group.
179+
175180

176181
> **Note**: by design, `petctl` tries to use the least number of AWS services. This
177182
was done intentionally to allow non-AWS users to easily transfer the functionality

aws/cfn/setup.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -252,7 +252,6 @@ Resources:
252252
Type: AWS::IAM::ManagedPolicy
253253
Properties:
254254
Description: "Allows container instances to use CloudWatch APIs"
255-
ManagedPolicyName: "ContainerCloudWatchLogsPolicy"
256255
Path: "/"
257256
PolicyDocument:
258257
Version: "2012-10-17"

aws/cloudformation.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010
import getpass
1111
import logging
1212
import os
13+
import random
14+
import string
1315

1416
from jinja2 import Template
1517
from util import wait_for
@@ -25,18 +27,23 @@ def __init__(self, session):
2527

2628
def create_specs_file(self, specs_file, s3_bucket_name, efs_id):
2729
username = getpass.getuser()
28-
stack_name = f"torchelastic-{username}"
30+
rand = "".join(random.choices(string.ascii_uppercase + string.digits, k=5))
31+
hash = f"{username}-{rand}"
32+
stack_name = f"torchelastic-{hash}"
2933
this_dir = os.path.dirname(__file__)
3034
cfn_template = os.path.join(this_dir, "cfn/setup.yml")
3135
sample_specs = os.path.join(this_dir, "config/sample_specs.json")
3236

3337
params = {
34-
"S3BucketName": s3_bucket_name,
35-
"EFSFileSystemId": efs_id,
36-
"WorkerRoleName": f"torchelastic_worker_role-{username}",
37-
"RendezvousRoleName": f"torchelastic_rendezvous_role-{username}",
38+
"WorkerRoleName": f"torchelastic_worker_role-{hash}",
39+
"RendezvousRoleName": f"torchelastic_rendezvous_role-{hash}",
3840
}
3941

42+
if s3_bucket_name:
43+
params["S3BucketName"] = s3_bucket_name
44+
if efs_id:
45+
params["EFSFileSystemId"] = efs_id
46+
4047
self.create_stack(stack_name, cfn_template, **params)
4148

4249
for _ in wait_for(

aws/config/user_data_worker

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,9 @@ cat > /var/torchelastic/run_worker <<\EOL
4747
container_name=$1
4848
shift
4949

50+
region=$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq .region -r)
51+
instance_id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
52+
5053
docker run \
5154
--init \
5255
--net=host \
@@ -55,6 +58,11 @@ docker run \
5558
--env-file /var/torchelastic/worker.env \
5659
-v /mnt/efs/fs1:/mnt/efs/fs1 \
5760
--name ${container_name} \
61+
--log-driver=awslogs \
62+
--log-opt awslogs-region=${region} \
63+
--log-opt awslogs-group=torchelastic/{{ user }} \
64+
--log-opt awslogs-create-group=true \
65+
--log-opt awslogs-stream=${container_name}/${instance_id} \
5866
{{ docker_image }} $*
5967
EOL
6068

aws/petctl.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ def parse_arguments(args, **default_args):
139139
help="s3 bucket to use for running petctl (if empty, one is created)",
140140
)
141141
parser_setup.add_argument(
142-
"--efs_id", default="", help="efs id to use, if empty, one is created"
142+
"--efs_id", help="efs id to use, if empty, one is created"
143143
)
144144

145145
petctl_args, script_args = split_args(args[1:])
@@ -190,6 +190,7 @@ def run_job(session, specs_json, args):
190190
worker_specs["job_name"] = job_name
191191
worker_specs["script"] = script
192192
worker_specs["args"] = " ".join(script_args)
193+
worker_specs["user"] = getpass.getuser()
193194

194195
instance_type = worker_specs["instance_type"]
195196
script_args_str = worker_specs["args"]

0 commit comments

Comments
 (0)