Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Commit 3751dcb

Browse files
aivanoufacebook-github-bot
authored andcommitted
Resolve minor readme issues for multi_container example
Summary: Resolve minor readme issues for multi_container example Reviewed By: drdarshan Differential Revision: D20924742 fbshipit-source-id: d3f209a03e0b16088f7bc71b7113fe3540b629ae
1 parent f753a98 commit 3751dcb

1 file changed

Lines changed: 12 additions & 12 deletions

File tree

examples/multi_container/README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,50 @@
11
# A minimal elastic agent example
22
In this example, we show how to use the PyTorch Elastic Trainer launcher to start a distributed application in an elastic and fault tolerant manner. The application is intentionally kept "bare bones" since the objective is to show how to create a `torch.distributed.ProcessGroup` instance. Once a `ProcessGroup` is created, you can use any functionality needed from the `torch.distributed` package.
33

4-
This application can be run on practically any machine that supports Docker containers and does not require installing additional software or modifying your existing Python environment.
4+
This application can be run on practically any machine that supports Docker containers and does not require installing additional software or modifying your existing Python environment.
55

6-
> The `docker-compose.yml` file is based on the example provided with the [Bitnami ETCD container image](https://hub.docker.com/r/bitnami/etcd/).
6+
> The `docker-compose.yml` file is based on the example provided with the [Bitnami ETCD container image](https://hub.docker.com/r/bitnami/etcd/).
77
88
## Prerequisites
99
We assume you have a recent version of Docker (version 18.03 or above) and Docker Compose installed on your machine. Verify the version by running
1010
```
1111
docker --version
12-
```
12+
```
1313
and
1414
```
1515
docker-compose --version
1616
```
1717
which should print something like
1818
```
1919
Docker version 19.03.8, build afacb8b
20-
```
20+
```
2121
and
2222
```
2323
docker-compose version 1.25.4, build 8d51620a
2424
```
2525
respectively.
2626
## Obtaining the example repo
27-
Clone the PyTorch Elastic Trainer Git repo using
27+
Clone the PyTorch Elastic Trainer Git repo using
2828
```
2929
git clone https://github.com/pytorch/elastic.git
3030
```
31-
and change directory to the folder containing this example:
31+
make an environment variable that points to the elastic repo, e.g.
3232
```
33-
cd elastic/examples/hello_elastic
33+
export TORCHELASTIC_HOME=~/elastic
3434
```
3535

3636
# Building the samples Docker container
3737
While you can run the rest of this example using a pre-built Docker image, you can also build one for yourself. This is especially useful if you would like to customize the image. To build the image, run:
3838
```
39-
docker build -t hello_elastic:dev .
39+
cd $TORCHELASTIC_HOME && docker build -t hello_elastic:dev .
4040
```
4141

42-
# Running an existing sample
43-
This example uses `docker-compose` to run two containers: one for the ETCD service and one for the sample application itself. Docker compose takes care of all aspects of establishing the network interfaces so the application container can communicate with the ETCD container.
42+
# Running an existing sample
43+
This example uses `docker-compose` to run two containers: one for the ETCD service and one for the sample application itself. Docker compose takes care of all aspects of establishing the network interfaces so the application container can communicate with the ETCD container.
4444

4545
To start the example, run
4646
```
47-
docker-compose up
47+
cd $TORCHELATIC_HOME/examples/multi_container && docker-compose up
4848
```
4949
You should see two sets of outputs, one from ETCD starting up and one from the application itself. The output from the application looks something like this:
5050

@@ -113,4 +113,4 @@ In this simple example, we illustrated the following principles when using PyTor
113113
2. How to obtain parameters such as the world size, local rank and the master URL within an application to establish the process group.
114114
3. How to configure parameters for an elastic job such as the number of workers per node and the number of times your application should be restarted in the event of failures.
115115

116-
In the next set of samples, we will cover more advanced topics such as checkpointing state in your application and deploying it to an orchestrator such as Kubernetes.
116+
In the next set of samples, we will cover more advanced topics such as checkpointing state in your application and deploying it to an orchestrator such as Kubernetes.

0 commit comments

Comments
 (0)