Skip to content
This repository was archived by the owner on Sep 19, 2022. It is now read-only.

Commit 181fc83

Browse files
Akado2009k8s-ci-robot
authored andcommitted
Repository structure updated (#81)
* repo structure fix + cifar fix * Typo fix * typos fixed * added to ddp
1 parent 114a507 commit 181fc83

File tree

11 files changed

+321
-249
lines changed

11 files changed

+321
-249
lines changed

examples/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
## Installation & deployment tips
2+
1. You need to configure your node to utilize GPU. In order this can be done the following way:
3+
* Install [nvidia-docker2](https://github.com/NVIDIA/nvidia-docker)
4+
* Connect to your MasterNode and set nvidia as the default run in `/etc/docker/daemon.json`:
5+
```
6+
{
7+
"default-runtime": "nvidia",
8+
"runtimes": {
9+
"nvidia": {
10+
"path": "/usr/bin/nvidia-container-runtime",
11+
"runtimeArgs": []
12+
}
13+
}
14+
}
15+
```
16+
* After that deploy nvidia-daemon to kubernetes:
17+
```bash
18+
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
19+
```
20+
21+
2. NVIDIA GPUs can now be consumed via container level resource requirements using the resource name nvidia.com/gpu:
22+
```
23+
resources:
24+
limits:
25+
nvidia.com/gpu: 2 # requesting 2 GPUs
26+
```
27+
28+
3. Building image. Each example has prebuilt images that are stored on google cloud resources (GCR). If you want to create your own image we recommend using dockerhub. Each example has its own Dockerfile that we strongly advise to use. To build your custom image follow instruction on [TechRepublic](https://www.techrepublic.com/article/how-to-create-a-docker-image-and-push-it-to-docker-hub/).
29+
30+
4. To deploy your job we recommend using official [kubeflow documentation](https://www.kubeflow.org/docs/guides/components/pytorch/). Each example has example yaml files for two versions of apis. Feel free to modify them, e.g. image or number of GPUs.

examples/ddp/mnist/cpu/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Tips
2+
3+
1. MPI can spawn different number of copies. It is controlled by mpirun -n inside the Dockerfile.
4+
5+
```
6+
mpirun -n <number_of_copies>
7+
```
8+
9+
**Note.** Each copy will utilise 1 CPU. You can binding each process a CPU using `-cpu-slot`. For more reference visit [mpirun docuentation](https://www.open-mpi.org/doc/v3.0/man1/mpirun.1.php).
10+

examples/ddp/mnist/gpu/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Tips
2+
3+
1. NVIDIA GPUs can now be via container level resource requirements using the resource name nvidia.com/gpu:
4+
```
5+
resources:
6+
limits:
7+
nvidia.com/gpu: 2 # requesting 2
8+
```
9+
**Keep in mind!** The number of GPUs used by workers and master should be less of equal to the number of available GPUs on your cluster/system. If you should have less, then we recommend you to reduce the number of workers, or use master ony (in case you have 1 GPU).
10+
11+
2. MPI can spawn different number of copies. It is controlled by mpirun -n inside the Dockerfile.
12+
13+
```
14+
mpirun -n <number_of_copies>
15+
```
16+
17+
**Note.** Each copy will utilise 1 GPU.
18+

examples/mpi-dist/mnist/cpu/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Tips
2+
3+
1. MPI can spawn different number of copies. It is controlled by mpirun -n inside the Dockerfile.
4+
5+
```
6+
mpirun -n <number_of_copies>
7+
```
8+
9+
**Note.** Each copy will utilise 1 CPU. You can binding each process a CPU using `-cpu-slot`. For more reference visit [mpirun docuentation](https://www.open-mpi.org/doc/v3.0/man1/mpirun.1.php).
10+

examples/mpi-dist/mnist/gpu/README.md

Lines changed: 16 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,18 @@
1-
## Installation tips
2-
1. You need to configure your node to utilize GPU. In order this can be done the following way:
3-
* Install [nvidia-docker2](https://github.com/NVIDIA/nvidia-docker)
4-
* Connect to your MasterNode and set nvidia as the default run in `/etc/docker/daymon.json`:
5-
```
6-
{
7-
"default-runtime": "nvidia",
8-
"runtimes": {
9-
"nvidia": {
10-
"path": "/usr/bin/nvidia-container-runtime",
11-
"runtimeArgs": []
12-
}
13-
}
14-
}
15-
```
16-
* After that deploy nvidia-daemon to kubernetes:
17-
```bash
18-
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
19-
```
20-
21-
2. NVIDIA GPUs can now be consumed via container level resource requirements using the resource name nvidia.com/gpu:
22-
23-
1+
# Tips
2+
3+
1. NVIDIA GPUs can now be via container level resource requirements using the resource name nvidia.com/gpu:
4+
```
245
resources:
256
limits:
26-
nvidia.com/gpu: 2 # requesting 2 GPUs
27-
7+
nvidia.com/gpu: 2 # requesting 2
8+
```
9+
**Keep in mind!** The number of GPUs used by workers and master should be less of equal to the number of available GPUs on your cluster/system. If you should have less, then we recommend you to reduce the number of workers, or use master ony (in case you have 1 GPU).
10+
11+
2. MPI can spawn different number of copies. It is controlled by mpirun -n inside the Dockerfile.
12+
13+
```
14+
mpirun -n <number_of_copies>
15+
```
16+
17+
**Note.** Each copy will utilise 1 GPU.
18+

examples/tcp-dist/cifar10/Dockerfile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
FROM pytorch/pytorch:0.4_cuda9_cudnn7
2+
3+
ADD . /opt/pytorch_dist_cifar
4+
ENTRYPOINT ["python", "/opt/pytorch_dist_cifar/dist_cifar.py"]

examples/tcp-dist/cifar10/README.md

Lines changed: 13 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,17 @@
1-
## PyTorch distributed examples
1+
### Distributed cifar model for e2e test
22

3-
Here are examples of jobs that use the operator.
3+
This folder containers Dockerfile and distributed cufar model for e2e test.
44

5-
1. An example of distributed CIFAR10 with pytorch on kubernetes:
6-
```
7-
kubectl apply -f cifar10/
8-
```
5+
**Build Image**
96

10-
For faster execution, pre-download the dataset to each of your cluster nodes and edit the
11-
cifar10/pytorchjob_cifar.yaml file to include the below "predownload" entries in the spec containers.
12-
The extra entries will need to be present for both MASTER and WORKER replica types.
13-
```
14-
spec:
15-
containers:
16-
- image: pytorch/pytorch:latest
17-
imagePullPolicy: IfNotPresent
18-
name: pytorch
19-
volumeMounts:
20-
- name: training-result
21-
mountPath: /tmp/result
22-
- name: entrypoint
23-
mountPath: /tmp/entrypoint
24-
- name: predownload <- Add this line
25-
mountpath: /data <- Add this line
26-
command: [/tmp/entrypoint/dist_train_cifar.py]
27-
restartPolicy: OnFailure
28-
volumes:
29-
- name: training-result
30-
emptyDir: {}
31-
- name: entrypoint
32-
configMap:
33-
name: dist-train-cifar
34-
defaultMode: 0755
35-
- name: predownload <- Add this line
36-
hostPath: <- Add this line
37-
path: [absolute_path_to_predownloaded_data] <- Add this line and path
38-
restartPolicy: OnFailure
39-
```
7+
The default image name and tag is `kubeflow/pytorch-dist-cifar-test:1.0`.
408

41-
2. A simple example of distributed MNIST with pytorch on kubernetes:
42-
```
43-
kubectl apply -f mnist/
44-
```
9+
```shell
10+
docker build -f Dockerfile -t kubeflow/pytorch-dist-cifar-test:1.0 ./
11+
```
12+
13+
**Create the mnist PyTorch job**
14+
15+
```
16+
kubectl create -f ./pytorch_job_cifar.yaml
17+
```

examples/tcp-dist/cifar10/configmap_cifar.yaml

Lines changed: 0 additions & 184 deletions
This file was deleted.

0 commit comments

Comments
 (0)