|
1 | 1 | # A minimal elastic agent example |
2 | 2 | In this example, we show how to use the PyTorch Elastic Trainer launcher to start a distributed application in an elastic and fault tolerant manner. The application is intentionally kept "bare bones" since the objective is to show how to create a `torch.distributed.ProcessGroup` instance. Once a `ProcessGroup` is created, you can use any functionality needed from the `torch.distributed` package. |
3 | 3 |
|
4 | | -This application can be run on practically any machine that supports Docker containers and does not require installing additional software or modifying your existing Python environment. |
| 4 | +This application can be run on practically any machine that supports Docker containers and does not require installing additional software or modifying your existing Python environment. |
5 | 5 |
|
6 | | -> The `docker-compose.yml` file is based on the example provided with the [Bitnami ETCD container image](https://hub.docker.com/r/bitnami/etcd/). |
| 6 | +> The `docker-compose.yml` file is based on the example provided with the [Bitnami ETCD container image](https://hub.docker.com/r/bitnami/etcd/). |
7 | 7 |
|
8 | 8 | ## Prerequisites |
9 | 9 | We assume you have a recent version of Docker (version 18.03 or above) and Docker Compose installed on your machine. Verify the version by running |
10 | 10 | ``` |
11 | 11 | docker --version |
12 | | -``` |
| 12 | +``` |
13 | 13 | and |
14 | 14 | ``` |
15 | 15 | docker-compose --version |
16 | 16 | ``` |
17 | 17 | which should print something like |
18 | 18 | ``` |
19 | 19 | Docker version 19.03.8, build afacb8b |
20 | | -``` |
| 20 | +``` |
21 | 21 | and |
22 | 22 | ``` |
23 | 23 | docker-compose version 1.25.4, build 8d51620a |
24 | 24 | ``` |
25 | 25 | respectively. |
26 | 26 | ## Obtaining the example repo |
27 | | -Clone the PyTorch Elastic Trainer Git repo using |
| 27 | +Clone the PyTorch Elastic Trainer Git repo using |
28 | 28 | ``` |
29 | 29 | git clone https://github.com/pytorch/elastic.git |
30 | 30 | ``` |
31 | | -and change directory to the folder containing this example: |
| 31 | +make an environment variable that points to the elastic repo, e.g. |
32 | 32 | ``` |
33 | | -cd elastic/examples/hello_elastic |
| 33 | +export TORCHELASTIC_HOME=~/elastic |
34 | 34 | ``` |
35 | 35 |
|
36 | 36 | # Building the samples Docker container |
37 | 37 | While you can run the rest of this example using a pre-built Docker image, you can also build one for yourself. This is especially useful if you would like to customize the image. To build the image, run: |
38 | 38 | ``` |
39 | | -docker build -t hello_elastic:dev . |
| 39 | +cd $TORCHELASTIC_HOME && docker build -t hello_elastic:dev . |
40 | 40 | ``` |
41 | 41 |
|
42 | | -# Running an existing sample |
43 | | -This example uses `docker-compose` to run two containers: one for the ETCD service and one for the sample application itself. Docker compose takes care of all aspects of establishing the network interfaces so the application container can communicate with the ETCD container. |
| 42 | +# Running an existing sample |
| 43 | +This example uses `docker-compose` to run two containers: one for the ETCD service and one for the sample application itself. Docker compose takes care of all aspects of establishing the network interfaces so the application container can communicate with the ETCD container. |
44 | 44 |
|
45 | 45 | To start the example, run |
46 | 46 | ``` |
47 | | -docker-compose up |
| 47 | +cd $TORCHELATIC_HOME/examples/multi_container && docker-compose up |
48 | 48 | ``` |
49 | 49 | You should see two sets of outputs, one from ETCD starting up and one from the application itself. The output from the application looks something like this: |
50 | 50 |
|
@@ -113,4 +113,4 @@ In this simple example, we illustrated the following principles when using PyTor |
113 | 113 | 2. How to obtain parameters such as the world size, local rank and the master URL within an application to establish the process group. |
114 | 114 | 3. How to configure parameters for an elastic job such as the number of workers per node and the number of times your application should be restarted in the event of failures. |
115 | 115 |
|
116 | | -In the next set of samples, we will cover more advanced topics such as checkpointing state in your application and deploying it to an orchestrator such as Kubernetes. |
| 116 | +In the next set of samples, we will cover more advanced topics such as checkpointing state in your application and deploying it to an orchestrator such as Kubernetes. |
0 commit comments