kubeflow
diff --git a/‎examples/README.md
Lines changed: 30 additions & 0 deletions b/‎examples/README.md
Lines changed: 30 additions & 0 deletions
diff --git a/‎examples/ddp/mnist/cpu/README.md
Lines changed: 10 additions & 0 deletions b/‎examples/ddp/mnist/cpu/README.md
Lines changed: 10 additions & 0 deletions
diff --git a/‎examples/ddp/mnist/gpu/README.md
Lines changed: 18 additions & 0 deletions b/‎examples/ddp/mnist/gpu/README.md
Lines changed: 18 additions & 0 deletions
diff --git a/‎examples/mpi-dist/mnist/cpu/README.md
Lines changed: 10 additions & 0 deletions b/‎examples/mpi-dist/mnist/cpu/README.md
Lines changed: 10 additions & 0 deletions
diff --git a/‎examples/mpi-dist/mnist/gpu/README.md
Lines changed: 16 additions & 25 deletions b/‎examples/mpi-dist/mnist/gpu/README.md
Lines changed: 16 additions & 25 deletions
diff --git a/‎examples/tcp-dist/cifar10/Dockerfile
Lines changed: 4 additions & 0 deletions b/‎examples/tcp-dist/cifar10/Dockerfile
Lines changed: 4 additions & 0 deletions
diff --git a/‎examples/tcp-dist/cifar10/README.md
Lines changed: 13 additions & 40 deletions b/‎examples/tcp-dist/cifar10/README.md
Lines changed: 13 additions & 40 deletions
diff --git a/‎examples/tcp-dist/cifar10/configmap_cifar.yaml
Lines changed: 0 additions & 184 deletions b/‎examples/tcp-dist/cifar10/configmap_cifar.yaml
Lines changed: 0 additions & 184 deletions
@@ -0,0 +1,30 @@
+## Installation & deployment tips 
+1. You need to configure your node to utilize GPU. In order this can be done the following way: 
+    * Install [nvidia-docker2](https://github.com/NVIDIA/nvidia-docker)
+    * Connect to your MasterNode and set nvidia as the default run in `/etc/docker/daemon.json`:
+        ```
+        {
+            "default-runtime": "nvidia",
+            "runtimes": {
+                "nvidia": {
+                    "path": "/usr/bin/nvidia-container-runtime",
+                    "runtimeArgs": []
+                }
+            }
+        }
+        ```
+    * After that deploy nvidia-daemon to kubernetes: 
+        ```bash
+        kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
+        ```
+        
+2. NVIDIA GPUs can now be consumed via container level resource requirements using the resource name nvidia.com/gpu:
+      ```
+      resources:
+        limits:
+            nvidia.com/gpu: 2 # requesting 2 GPUs
+      ```
+
+3. Building image. Each example has prebuilt images that are stored on google cloud resources (GCR). If you want to create your own image we recommend using dockerhub. Each example has its own Dockerfile that we strongly advise to use. To build your custom image follow instruction on [TechRepublic](https://www.techrepublic.com/article/how-to-create-a-docker-image-and-push-it-to-docker-hub/).
+
+4. To deploy your job we recommend using official [kubeflow documentation](https://www.kubeflow.org/docs/guides/components/pytorch/). Each example has example yaml files for two versions of apis. Feel free to modify them, e.g. image or number of GPUs.
@@ -0,0 +1,10 @@
+# Tips
+
+1. MPI can spawn different number of copies. It is controlled by mpirun -n inside the Dockerfile.
+
+    ```
+        mpirun -n <number_of_copies>
+    ```
+    
+    **Note.** Each copy will utilise 1 CPU. You can binding each process a CPU using `-cpu-slot`. For more reference visit [mpirun docuentation](https://www.open-mpi.org/doc/v3.0/man1/mpirun.1.php).
+   
@@ -0,0 +1,18 @@
+# Tips
+
+1. NVIDIA GPUs can now be via container level resource requirements using the resource name nvidia.com/gpu:
+    ```
+      resources:
+        limits:
+            nvidia.com/gpu: 2 # requesting 2 
+    ```
+    **Keep in mind!** The number of GPUs used by workers and master should be less of equal to the number of available GPUs on your cluster/system. If you should have less, then we recommend you to reduce the number of workers, or use master ony (in case you have 1 GPU).
+    
+2. MPI can spawn different number of copies. It is controlled by mpirun -n inside the Dockerfile.
+
+    ```
+        mpirun -n <number_of_copies>
+    ```
+    
+    **Note.** Each copy will utilise 1 GPU.
+   
@@ -0,0 +1,10 @@
+# Tips
+
+1. MPI can spawn different number of copies. It is controlled by mpirun -n inside the Dockerfile.
+
+    ```
+        mpirun -n <number_of_copies>
+    ```
+    
+    **Note.** Each copy will utilise 1 CPU. You can binding each process a CPU using `-cpu-slot`. For more reference visit [mpirun docuentation](https://www.open-mpi.org/doc/v3.0/man1/mpirun.1.php).
+   
@@ -1,27 +1,18 @@
-## Installation tips
-1. You need to configure your node to utilize GPU. In order this can be done the following way: 
-    * Install [nvidia-docker2](https://github.com/NVIDIA/nvidia-docker)
-    * Connect to your MasterNode and set nvidia as the default run in `/etc/docker/daymon.json`:
-        ```
-        {
-            "default-runtime": "nvidia",
-            "runtimes": {
-                "nvidia": {
-                    "path": "/usr/bin/nvidia-container-runtime",
-                    "runtimeArgs": []
-                }
-            }
-        }
-        ```
-    * After that deploy nvidia-daemon to kubernetes: 
-        ```bash
-        kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
-        ```
-        
-2. NVIDIA GPUs can now be consumed via container level resource requirements using the resource name nvidia.com/gpu:
-   
-        
+# Tips
+
+1. NVIDIA GPUs can now be via container level resource requirements using the resource name nvidia.com/gpu:
+    ```
       resources:
         limits:
-            nvidia.com/gpu: 2 # requesting 2 GPUs
-        
+            nvidia.com/gpu: 2 # requesting 2 
+    ```
+    **Keep in mind!** The number of GPUs used by workers and master should be less of equal to the number of available GPUs on your cluster/system. If you should have less, then we recommend you to reduce the number of workers, or use master ony (in case you have 1 GPU).
+    
+2. MPI can spawn different number of copies. It is controlled by mpirun -n inside the Dockerfile.
+
+    ```
+        mpirun -n <number_of_copies>
+    ```
+    
+    **Note.** Each copy will utilise 1 GPU.
+   
@@ -0,0 +1,4 @@
+FROM pytorch/pytorch:0.4_cuda9_cudnn7
+
+ADD . /opt/pytorch_dist_cifar
+ENTRYPOINT ["python", "/opt/pytorch_dist_cifar/dist_cifar.py"]
@@ -1,44 +1,17 @@
-## PyTorch distributed examples
+### Distributed cifar model for e2e test
 
-Here are examples of jobs that use the operator.
+This folder containers Dockerfile and distributed cufar model for e2e test.
 
-1. An example of distributed CIFAR10 with pytorch on kubernetes:
-   ```
-   kubectl apply -f cifar10/
-   ```
+**Build Image**
 
-   For faster execution, pre-download the dataset to each of your cluster nodes and edit the
-   cifar10/pytorchjob_cifar.yaml file to include the below "predownload" entries in the spec containers.
-   The extra entries will need to be present for both MASTER and WORKER replica types.
-   ```
-    spec:
-      containers:
-      - image: pytorch/pytorch:latest
-        imagePullPolicy: IfNotPresent
-        name: pytorch
-        volumeMounts:
-          - name: training-result
-            mountPath: /tmp/result
-          - name: entrypoint
-            mountPath: /tmp/entrypoint
-          - name: predownload                               <- Add this line
-            mountpath: /data                                <- Add this line
-        command: [/tmp/entrypoint/dist_train_cifar.py]
-      restartPolicy: OnFailure
-      volumes:
-        - name: training-result
-          emptyDir: {}
-        - name: entrypoint
-          configMap:
-            name: dist-train-cifar
-            defaultMode: 0755
-        - name: predownload                                 <- Add this line
-          hostPath:                                         <- Add this line
-            path: [absolute_path_to_predownloaded_data]     <- Add this line and path
-      restartPolicy: OnFailure
-    ```
+The default image name and tag is `kubeflow/pytorch-dist-cifar-test:1.0`.
 
-2. A simple example of distributed MNIST with pytorch on kubernetes:
-   ```
-   kubectl apply -f mnist/
-   ```
+```shell
+docker build -f Dockerfile -t kubeflow/pytorch-dist-cifar-test:1.0 ./
+```
+
+**Create the mnist PyTorch job**
+
+```
+kubectl create -f ./pytorch_job_cifar.yaml
+```