You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 6, 2023. It is now read-only.
Make setup.yml cfn template support multiple stacks in a single account, fix bug in petctl setp where None was being passed to cfn param, pump docker logs to cloudwatch
Summary:
1. Uses docker log-driver == awslogs to make docker output go to cloud watch (see screenshots below)
2. #1 creates a log group called `torchelastic/$USER` in CW and creates log streams (one per worker) called `$job_name/$instance_id`
3. Fixes a bug in `petctl setup` where if no efs and s3 buckets are specified the `NoneType` is passed to the cfn template param which throws a validation error because it expects a string
4. Fixes an issue with cfn template where the CloudWatch IAM managed policy was being created with a specific name hence preventing multiple stacks from being created in the same account.
#thanks Vinicius Reis for testing `petctl` and reporting bugs #3 and #4.
{F223965947}
{F223965943}
Reviewed By: vreis
Differential Revision: D18826855
fbshipit-source-id: 2d75f607734135ab6d5301fc636501a38cfee9d9
Copy file name to clipboardExpand all lines: aws/README.md
+13-8Lines changed: 13 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -133,8 +133,13 @@ Auto Scaling Groups
133
133
1. etcd server
134
134
2. workers
135
135
136
-
#### SSH onto worker nodes
137
-
To SSH onto the worker nodes to inspect the worker process we use AWS
136
+
#### Inspect the logs
137
+
Log into the AWS CloudWatch Logs console. You should see a log group called
138
+
`torchelastic/$USER`. Under it there will be a log stream per instance with the
139
+
name `$job_name/$instance_id` (e.g. `my_job/i0b938EXAMPLE`).
140
+
141
+
#### Troubleshooting
142
+
To SSH onto the worker nodes to debug/inspect the worker process use AWS
138
143
Session Manager instead of the ec2 key pair. [Install](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html)
139
144
the Session Manager plugin and run
140
145
@@ -160,18 +165,18 @@ docker ps
160
165
docker logs -f <container id>
161
166
```
162
167
168
+
> Note since we have configured the log driver to be `awslogs` tailing
169
+
the docker logs will not work. For more information see: https://docs.docker.com/config/containers/logging/awslogs/
170
+
163
171
You can also manually stop and start the workers by running
164
172
```bash
165
173
systemctl stop torchelastic_worker
166
174
systemctl start torchelastic_worker
167
175
```
168
176
169
-
> **EXCERCISE:** Open up two terminals and SSH onto each worker. Tail the docker logs
170
-
on each worker. Now stop worker 1 and observe the worker 2 re-rendezvous and
171
-
since `--min_size=1` it continues training by itself. Now restart worker 1 and
172
-
observe that worker 2 notices that worker 1 is waiting to join and re-rendezvous,
173
-
the `state` object in worker 2 is `sync()`'ed to worker 1 and both resume training
174
-
without loss of progress.
177
+
> **EXCERCISE:** Try stopping or adding worker(s) to see elasticity in action!
178
+
To add workers, simply increase the `desired` size of the worker autoscaling group.
179
+
175
180
176
181
> **Note**: by design, `petctl` tries to use the least number of AWS services. This
177
182
was done intentionally to allow non-AWS users to easily transfer the functionality
0 commit comments