Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Commit 79daae3

Browse files
Kiuk Chungfacebook-github-bot
authored andcommitted
update docs
Summary: 1. mention that entrypoint to torchelastic/exmples docker is torchelastic.distributed.launch 2. added a section on "next steps" which directs users to try building their own worker image and mentions that you should consider using S3 or Blob Storage for checkpoints (cc chauhang) 3. added instructions on tailing worker logs 4. shortened example outputs for brevity 5. document "rdzv-id" caveats Reviewed By: tierex Differential Revision: D20927596 fbshipit-source-id: d987e8c3e57e41ab5aad38d41794faef94e088c3
1 parent 3751dcb commit 79daae3

1 file changed

Lines changed: 43 additions & 59 deletions

File tree

kubernetes/README.md

Lines changed: 43 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -130,37 +130,38 @@ kubectl logs -f elastic-job-k8s-controller-6d4884c75b-z22cm -n elastic-job
130130
etcd-service ClusterIP 10.100.104.168 <none> 2379/TCP 5m5s
131131
```
132132
133-
1. Update `config/samples/<imagenet.yaml|classy_vision.yaml>`:
133+
1. Update `config/samples/imagenet.yaml`:
134134
1. set `rdzvEndpoint` (e.g. `10.100.104.168:2379`) to the etcd server you just provisioned.
135135
1. set `minReplicas` and `maxReplicas` to the desired min and max num nodes
136136
(max should not exceed your cluster capacity)
137137
1. set `Worker.replicas` to the number of nodes to start with (you may
138138
modify this later to scale the job in/out)
139139
1. set the correct `--nproc_per_node` in `container.args` based on the
140140
instance you are running on.
141-
142-
> **IMPORTANT** A `Worker` in the context of kubernetes refers to `Node` in
141+
142+
> **NOTE** the `ENTRYPOINT` to `torchelastic/examples` is
143+
`python -m torchelastic.distributed.launch <args...>`. Notice that you
144+
do not have to specify certain `launch` options such as `--rdzv_endpoint`,
145+
and `--rdzv_id`. These are set automatically by the controller.
146+
147+
> **IMPORTANT** a `Worker` in the context of kubernetes refers to `Node` in
143148
`torchelastic.distributed.launch`. Each kubernetes `Worker` can run multiple
144149
trainers processes (a.k.a `worker` in `torchelastic.distributed.launch`).
145150
151+
146152
1. Submit the training job.
147153
148154
```
149155
kubectl apply -f config/samples/imagenet.yaml
150156
```
151157
152158
As you can see, training pod and headless services have been created.
153-
```yaml
159+
```
154160
$ kubectl get pods -n elastic-job
155161
NAME READY STATUS RESTARTS AGE
156162
elastic-job-k8s-controller-6d4884c75b-z22cm 1/1 Running 0 11m
157163
imagenet-worker-0 1/1 Running 0 5s
158164
imagenet-worker-1 1/1 Running 0 5s
159-
160-
$ kubectl get svc -n elastic-job
161-
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
162-
imagenet-worker-0 ClusterIP None <none> 10291/TCP 34s
163-
imagenet-worker-1 ClusterIP None <none> 10291/TCP 34s
164165
```
165166
166167
1. You can scale the number of nodes by adjusting
@@ -173,57 +174,25 @@ kubectl logs -f elastic-job-k8s-controller-6d4884c75b-z22cm -n elastic-job
173174
increments of `nproc_per_node` trainers. In our case ``--nproc_per_node=1``
174175
For better performance consider using an instance with multiple
175176
GPUs and setting `--nproc_per_node=$NUM_CUDA_DEVICES`.
176-
177+
178+
> **WARNING** the name of the job is used as `rdzv_id`, which is used
179+
to uniquely identify a job run instance. Hence to run multiple parallel
180+
jobs with the same spec you need to change `.spec.metadata.name` to
181+
give it a unique run id (e.g. `imagenet_run_0`). Otherwise the new nodes
182+
will attempt to join the membership of a different run.
183+
184+
177185
### Monitoring jobs
178186
179187
You can describe the job to check job status and job related events.
180188
In following example, `imagenet` job is created in `elastic-job` namespace, change to use your job name and namespace in your command.
181189
182190
```
183-
kubectl describe elasticjob imagenet -n elastic-job
191+
$ kubectl describe elasticjob imagenet -n elastic-job
184192

185193
Name: imagenet
186194
Namespace: elastic-job
187-
Labels: <none>
188-
Annotations: kubectl.kubernetes.io/last-applied-configuration:
189-
{"apiVersion":"elastic.pytorch.org/v1alpha1","kind":"ElasticJob","metadata":{"annotations":{},"name":"imagenet","namespace":"elastic-job"}...
190-
API Version: elastic.pytorch.org/v1alpha1
191-
Kind: ElasticJob
192-
Metadata:
193-
Creation Timestamp: 2020-03-19T10:30:55Z
194-
Generation: 5
195-
Resource Version: 2110451
196-
Self Link: /apis/elastic.pytorch.org/v1alpha1/namespaces/elastic-job/elasticjobs/imagenet
197-
UID: b6f6b7ae-69cc-11ea-b995-0653198c16be
198-
Spec:
199-
Run Policy:
200-
Max Replicas: 5
201-
Min Replicas: 1
202-
Rdzv Endpoint: etcd-service:2379
203-
Replica Specs:
204-
Worker:
205-
Replicas: 2
206-
Restart Policy: ExitCode
207-
Template:
208-
Metadata:
209-
Creation Timestamp: <nil>
210-
Spec:
211-
Containers:
212-
Args:
213-
/workspace/examples/imagenet/main.py
214-
--input_path
215-
/data/tiny-imagenet-200/train
216-
--epochs
217-
10
218-
Image: seedjeffwan/examples:0.1.0rc1
219-
Image Pull Policy: Always
220-
Name: elasticjob-worker
221-
Ports:
222-
Container Port: 10291
223-
Name: elasticjob-port
224-
Resources:
225-
Limits:
226-
nvidia.com/gpu: 1
195+
<... OMITTED ...>
227196
Status:
228197
Conditions:
229198
Last Transition Time: 2020-03-19T10:30:55Z
@@ -232,21 +201,36 @@ Status:
232201
Reason: ElasticJobRunning
233202
Status: True
234203
Type: Running
235-
Replica Statuses:
236-
Worker:
237-
Active: 3
204+
<... OMITTED ...>
238205
Events:
239206
Type Reason Age From Message
240207
---- ------ ---- ---- -------
241208
Normal SuccessfulCreatePod 13s elastic-job-controller Created pod: imagenet-worker-0
242-
Normal SuccessfulCreatePod 13s elastic-job-controller Created pod: imagenet-worker-1
243-
Normal SuccessfulCreatePod 13s elastic-job-controller Created pod: imagenet-worker-2
244-
Normal SuccessfulCreateService 13s elastic-job-controller Created service: imagenet-worker-0
245-
Normal SuccessfulCreateService 13s elastic-job-controller Created service: imagenet-worker-1
246-
Normal SuccessfulCreateService 13s elastic-job-controller Created service: imagenet-worker-2
209+
```
247210
211+
Tail the logs of a worker:
212+
213+
```
214+
$ kubectl logs -f -n elastic-job imagenet-worker-0
248215
```
249216
217+
### Next Steps
218+
219+
We have included other sample job specs in the `config/samples` directory
220+
(e.g. `config/samples/classy_vision.yaml`), try them out by
221+
replacing `imagenet.yaml` with the appropariate spec filename in the
222+
instructions above.
223+
224+
To use your own script, build a docker image containing your script.
225+
You can use `torchelastic/examples` as your base image. Then point your
226+
job specs to use your container by editing `Worker.template.spec.containers.image`.
227+
228+
Our examples save checkpoints and models in the container hence the trained
229+
model and checkpoints are not accessible after the job is complete. In your
230+
scripts use a persistent store like AWS S3 or Azure Blob Storage.
231+
232+
233+
250234
### Trouble Shooting
251235
252236
Please check [TROUBLESHOOTING.md](./TROUBLESHOOTING.md)

0 commit comments

Comments
 (0)