Skip to content

Commit cce8a28

Browse files
author
Gregor Gebhardt
committed
wrote proper README and changed version number.
1 parent 00f0d3b commit cce8a28

File tree

2 files changed

+207
-3
lines changed

2 files changed

+207
-3
lines changed

README.md

Lines changed: 206 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,209 @@
11
# ClusterWork
22

3-
A framework to run experiments on an computing cluster. The code is based on the Python Experiment Suite by Thomas
4-
Rückstiess which can be found here.
3+
A framework to easily deploy experiments on an computing cluster with mpi.
4+
ClusterWork is based on the [Python Experiment Suite](https://github.com/rueckstiess/expsuite) by Thomas Rückstiess and uses the [job_stream](https://wwoods.github.io/job_stream/) package to distribute the work.
55

6+
## Installation
7+
8+
0. Creating a virtual environment with [virtualenv](https://virtualenv.pypa.io/en/stable/) or [conda](https://conda.io/miniconda.html) is recommended. For the installation this virtual environment has to be activated first.
9+
1. Install required packages
10+
1. job_stream requires [boost](http://www.boost.org/) (filesystem, mpi, python, regex, serialization, system, thread) and mpi (e.g., [OpenMPI](http://www.open-mpi.org/))
11+
```sh
12+
sudo apt-get install libboost-dev libopenmpi-dev
13+
```
14+
2. ClusterWork requires the Python packages PyYAML, job_stream and pandas
15+
```sh
16+
pip install PyYAML job_stream pandas
17+
```
18+
2. Clone this repository and install it
19+
```sh
20+
git clone https://github.com/gregorgebhardt/cluster_work cluster_work
21+
cd cluster_work
22+
pip install .
23+
```
24+
25+
## Get your code on the computing cluster
26+
Running your code on the computing cluster is now a very simple task.
27+
Currently, this requires the following three steps:
28+
29+
1. Write a Python class that inherits from `ClusterWork` and implements at least the methods `reset(self, config: dict, rep: int)` and `iterate(self, config: dict, rep: int, n: int)`.
30+
2. Write a simple YAML-file to configure your experiment.
31+
3. Adopt a shell script that starts the experiment on your cluster.
32+
33+
### Subclassing `ClusterWork`
34+
35+
```Python
36+
from cluster_work import ClusterWork
37+
38+
class MyExperiment(ClusterWork):
39+
# ...
40+
41+
def reset(self, config=None, rep=0):
42+
# run code that sets up your experiment for each repetition here
43+
pass
44+
45+
def iterate(self, config=None, rep=0, n=0):
46+
# run your experiment for iteration n
47+
# return results as a dictionary, for each key there will be one column in a results table.
48+
pass
49+
50+
51+
# to run the experiments, you simply call run on your derived class
52+
if __name__ == '__main__':
53+
MyExperiment.run()
54+
```
55+
56+
#### Restarting your experiments
57+
58+
ClusterWork also implement a restart functionality. Since your results are stored after each iteration, your experiment can be restarted if its execution was interrupted for some reason. To obtain this functionality, you need to implement in addition at least the method `restore_state(self, config: dict, rep: int, n: int)`. Additionally, the method `save_state(self, config: dict, rep: int, n: int)` can be implemented to store additional information the needs to be loaded in the `restore_state` method. Finally, a flag `_restore_supported` must be set to `True`.
59+
60+
```Python
61+
class MyExperiment(ClusterWork):
62+
_restore_supported = True
63+
64+
# ...
65+
66+
def save_state(self, config: dict, rep: int, n: int):
67+
# save all the necessary information for restoring the state later
68+
pass
69+
70+
def restore_state(self, config: dict, rep: int, n: int):
71+
# load or reconstruct the state at repetition rep and iteration n
72+
pass
73+
```
74+
75+
#### Default parameters
76+
77+
The parameters for the experiment can be defined in an YAML-file that is passed as an command-line argument. Inside the derived class, we can define default parameters as a dictionary in the `_default_params` field:
78+
79+
```Python
80+
class MyExperiment(ClusterWork):
81+
# ...
82+
83+
_default_params = {
84+
# ...
85+
'num_episodes': 100,
86+
'num_test_episodes': 10,
87+
'num_eval_episodes': 10,
88+
'num_steps': 30,
89+
'num_observations': 30,
90+
91+
'optimizer_options': {
92+
'maxiter': 100
93+
},
94+
# ...
95+
}
96+
97+
# ...
98+
```
99+
100+
### The Configuration YAML
101+
102+
To configure the execution of the experiment, we need to write a small YAML-file. The YAML file consists several documents which are separated by a line of `---`. Optionally, the first document can be made a default by setting the key `name` to `"DEFAULT"`. This default document will then form the basis for all following experiment documents. Besides the optional default document, each document represents an experiment. However, experiments can be expanded by the _list_ __or__ _grid_ feature, which is explained below.
103+
104+
The required keys for each experiment are `name`, `repetitions`, `iterations`, and `path`. The parameters found below the key `params` overwrite the default parameters defined in the experiment class. Since the `config` dictionary that is passed to the methods of the ClusterWork subclass is the full configuration generated from the YAML-file and the default parameters, additional keys can be used.
105+
106+
```YAML
107+
---
108+
# default document denoted by the name "DEFAULT"
109+
name: "DEFAULT"
110+
repetitions: 20
111+
iterations: 5
112+
# this is the path where the results are stored,
113+
# it can be different from the location of the experiment scripts
114+
path: "path/to/experiment/folder"
115+
116+
params:
117+
num_episodes: 150
118+
optimizer_options:
119+
maxiter: 50
120+
---
121+
# 1. experiment
122+
name: "more_test_episodes"
123+
params:
124+
num_test_episodes: False
125+
---
126+
# 2. experiment
127+
name: "more_steps"
128+
params:
129+
num_steps: 50
130+
```
131+
132+
#### The list feature
133+
134+
If the key `list` is given in an experiment document, the experiment will be expanded for each value in the given list. For example
135+
136+
```
137+
# ...
138+
---
139+
name: "test_parameter_a"
140+
params:
141+
b: 5
142+
list:
143+
a: [5, 10, 20, 30]
144+
```
145+
146+
creates four experiments, one for each value of `a`. It is also possible define multiple parameters below the `list` key:
147+
148+
```
149+
# ...
150+
---
151+
name: "test_parameter_a_and_c"
152+
params:
153+
b: 5
154+
list:
155+
a: [5, 10, 20, 30]
156+
c: [1, 2, 3, 4, 5]
157+
```
158+
159+
In this case the experiment is run for the four parameter combinations `{a: 5, c: 1}`, `{a: 10, c: 2}`, ..., `{a: 30, c: 4}`. Since the list for `a` is shorter than the list for `c`, the remaining values for `c` are ignored.
160+
161+
#### The grid feature
162+
163+
The `grid` feature is similar to the `list` feature, however instead of iterating over all lists jointly it spans a grid with all the values given. For example
164+
165+
```
166+
# ...
167+
---
168+
name: "test_parameters_foo_and_bar"
169+
params:
170+
a: 5
171+
grid:
172+
foo: [5, 10, 20, 30]
173+
bar: [1, 2, 3, 4, 5]
174+
```
175+
176+
would run an experiment for each combination of `foo` and `bar`, in this case 4x5=20 experiments. Note that with more parameters below the `grid` key the number of experiments explodes exponentially.
177+
178+
### Run the experiment
179+
180+
To run your experiment, you can simply execute your python script:
181+
182+
```
183+
python YOUR_SCRIPT.py YOUR_CONFIG.yml [arguments]
184+
```
185+
186+
The following arguments are available:
187+
188+
+ `-c, --cluster` runs the experiments using the job_stream scheduler. By default the experiments are executed sequentially in a loop.
189+
+ `-d, --delete` delete old results before running your experiments
190+
+ `-e, --experiments [experiments]` chooses the experiments that should run, by default all experiments will run.
191+
+ `-v, --verbose` shows more output
192+
+ `-p, --progress` displays only the progress of running or finished experiments.
193+
194+
#### Hostfile for OpenMPI
195+
196+
Depending on the configuration of your cluster, `job_stream` might not know about the available computing units in the cluster. In this case it is possible to pass a hostfile that specifies hostnames and available cpus to job_stream before starting the experiment script.
197+
198+
```sh
199+
job_stream --hostfile HOSTFILE -- python YOUR_SCRIPT.py YOUR_CONFIG.yml [arguments]
200+
```
201+
202+
For a SLURM-based cluster, this hostfile can be created by
203+
204+
```
205+
srun hostname > hostfile.$SLURM_JOB_ID
206+
hostfileconv hostfile.$SLURM_JOB_ID
207+
```
208+
209+
where `hostfileconv` is a tool provided by ClusterWork that makes sure the hostfile has the right format. In this case the `HOSTFILE` argument for job_stream would be `hostfile.$SLURM_JOB_ID.converted`.

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
# Versions should comply with PEP440. For a discussion on single-sourcing
1616
# the version across setup.py and the project code, see
1717
# https://packaging.python.org/en/latest/single_source_version.html
18-
version='0.2.1',
18+
version='0.2.2',
1919

2020
description='A framework to run experiments on an computing cluster.',
2121
long_description=long_description,

0 commit comments

Comments
 (0)