@@ -130,37 +130,38 @@ kubectl logs -f elastic-job-k8s-controller-6d4884c75b-z22cm -n elastic-job
130130 etcd-service ClusterIP 10.100.104.168 <none > 2379/TCP 5m5s
131131 ```
132132
133- 1. Update `config/samples/< imagenet.yaml|classy_vision.yaml> `:
133+ 1. Update `config/samples/imagenet.yaml`:
134134 1. set `rdzvEndpoint` (e.g. `10.100.104.168:2379`) to the etcd server you just provisioned.
135135 1. set `minReplicas` and `maxReplicas` to the desired min and max num nodes
136136 (max should not exceed your cluster capacity)
137137 1. set `Worker.replicas` to the number of nodes to start with (you may
138138 modify this later to scale the job in/out)
139139 1. set the correct `--nproc_per_node` in `container.args` based on the
140140 instance you are running on.
141-
142- > **IMPORTANT** A `Worker` in the context of kubernetes refers to `Node` in
141+
142+ > **NOTE** the `ENTRYPOINT` to `torchelastic/examples` is
143+ `python -m torchelastic.distributed.launch <args...>`. Notice that you
144+ do not have to specify certain `launch` options such as `--rdzv_endpoint`,
145+ and `--rdzv_id`. These are set automatically by the controller.
146+
147+ > **IMPORTANT** a `Worker` in the context of kubernetes refers to `Node` in
143148 `torchelastic.distributed.launch`. Each kubernetes `Worker` can run multiple
144149 trainers processes (a.k.a `worker` in `torchelastic.distributed.launch`).
145150
151+
1461521. Submit the training job.
147153
148154 ```
149155 kubectl apply -f config/samples/imagenet.yaml
150156 ```
151157
152158 As you can see, training pod and headless services have been created.
153- ```yaml
159+ ```
154160 $ kubectl get pods -n elastic-job
155161 NAME READY STATUS RESTARTS AGE
156162 elastic-job-k8s-controller-6d4884c75b-z22cm 1/1 Running 0 11m
157163 imagenet-worker-0 1/1 Running 0 5s
158164 imagenet-worker-1 1/1 Running 0 5s
159-
160- $ kubectl get svc -n elastic-job
161- NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
162- imagenet-worker-0 ClusterIP None <none> 10291/TCP 34s
163- imagenet-worker-1 ClusterIP None <none> 10291/TCP 34s
164165 ```
165166
1661671. You can scale the number of nodes by adjusting
@@ -173,57 +174,25 @@ kubectl logs -f elastic-job-k8s-controller-6d4884c75b-z22cm -n elastic-job
173174 increments of `nproc_per_node` trainers. In our case ``--nproc_per_node=1``
174175 For better performance consider using an instance with multiple
175176 GPUs and setting `--nproc_per_node=$NUM_CUDA_DEVICES`.
176-
177+
178+ > **WARNING** the name of the job is used as `rdzv_id`, which is used
179+ to uniquely identify a job run instance. Hence to run multiple parallel
180+ jobs with the same spec you need to change `.spec.metadata.name` to
181+ give it a unique run id (e.g. `imagenet_run_0`). Otherwise the new nodes
182+ will attempt to join the membership of a different run.
183+
184+
177185### Monitoring jobs
178186
179187You can describe the job to check job status and job related events.
180188In following example, `imagenet` job is created in `elastic-job` namespace, change to use your job name and namespace in your command.
181189
182190```
183- kubectl describe elasticjob imagenet -n elastic-job
191+ $ kubectl describe elasticjob imagenet -n elastic-job
184192
185193Name: imagenet
186194Namespace: elastic-job
187- Labels: <none >
188- Annotations: kubectl.kubernetes.io/last-applied-configuration:
189- {"apiVersion":"elastic.pytorch.org/v1alpha1","kind":"ElasticJob","metadata":{"annotations":{},"name":"imagenet","namespace":"elastic-job"}...
190- API Version: elastic.pytorch.org/v1alpha1
191- Kind: ElasticJob
192- Metadata:
193- Creation Timestamp: 2020-03-19T10:30:55Z
194- Generation: 5
195- Resource Version: 2110451
196- Self Link: /apis/elastic.pytorch.org/v1alpha1/namespaces/elastic-job/elasticjobs/imagenet
197- UID: b6f6b7ae-69cc-11ea-b995-0653198c16be
198- Spec:
199- Run Policy:
200- Max Replicas: 5
201- Min Replicas: 1
202- Rdzv Endpoint: etcd-service:2379
203- Replica Specs:
204- Worker:
205- Replicas: 2
206- Restart Policy: ExitCode
207- Template:
208- Metadata:
209- Creation Timestamp: <nil >
210- Spec:
211- Containers:
212- Args:
213- /workspace/examples/imagenet/main.py
214- --input_path
215- /data/tiny-imagenet-200/train
216- --epochs
217- 10
218- Image: seedjeffwan/examples:0.1.0rc1
219- Image Pull Policy: Always
220- Name: elasticjob-worker
221- Ports:
222- Container Port: 10291
223- Name: elasticjob-port
224- Resources:
225- Limits:
226- nvidia.com/gpu: 1
195+ <... OMITTED ...>
227196Status:
228197 Conditions:
229198 Last Transition Time: 2020-03-19T10:30:55Z
@@ -232,21 +201,36 @@ Status:
232201 Reason: ElasticJobRunning
233202 Status: True
234203 Type: Running
235- Replica Statuses:
236- Worker:
237- Active: 3
204+ <... OMITTED ...>
238205Events:
239206 Type Reason Age From Message
240207 ---- ------ ---- ---- -------
241208 Normal SuccessfulCreatePod 13s elastic-job-controller Created pod: imagenet-worker-0
242- Normal SuccessfulCreatePod 13s elastic-job-controller Created pod: imagenet-worker-1
243- Normal SuccessfulCreatePod 13s elastic-job-controller Created pod: imagenet-worker-2
244- Normal SuccessfulCreateService 13s elastic-job-controller Created service: imagenet-worker-0
245- Normal SuccessfulCreateService 13s elastic-job-controller Created service: imagenet-worker-1
246- Normal SuccessfulCreateService 13s elastic-job-controller Created service: imagenet-worker-2
209+ ```
247210
211+ Tail the logs of a worker:
212+
213+ ```
214+ $ kubectl logs -f -n elastic-job imagenet-worker-0
248215```
249216
217+ ### Next Steps
218+
219+ We have included other sample job specs in the `config/samples` directory
220+ (e.g. `config/samples/classy_vision.yaml`), try them out by
221+ replacing `imagenet.yaml` with the appropariate spec filename in the
222+ instructions above.
223+
224+ To use your own script, build a docker image containing your script.
225+ You can use `torchelastic/examples` as your base image. Then point your
226+ job specs to use your container by editing `Worker.template.spec.containers.image`.
227+
228+ Our examples save checkpoints and models in the container hence the trained
229+ model and checkpoints are not accessible after the job is complete. In your
230+ scripts use a persistent store like AWS S3 or Azure Blob Storage.
231+
232+
233+
250234### Trouble Shooting
251235
252236Please check [TROUBLESHOOTING.md](./TROUBLESHOOTING.md)
0 commit comments