|
1 | 1 | # ClusterWork |
2 | 2 |
|
3 | | -A framework to run experiments on an computing cluster. The code is based on the Python Experiment Suite by Thomas |
4 | | -Rückstiess which can be found here. |
| 3 | +A framework to easily deploy experiments on an computing cluster with mpi. |
| 4 | +ClusterWork is based on the [Python Experiment Suite](https://github.com/rueckstiess/expsuite) by Thomas Rückstiess and uses the [job_stream](https://wwoods.github.io/job_stream/) package to distribute the work. |
5 | 5 |
|
| 6 | +## Installation |
| 7 | + |
| 8 | +0. Creating a virtual environment with [virtualenv](https://virtualenv.pypa.io/en/stable/) or [conda](https://conda.io/miniconda.html) is recommended. For the installation this virtual environment has to be activated first. |
| 9 | +1. Install required packages |
| 10 | + 1. job_stream requires [boost](http://www.boost.org/) (filesystem, mpi, python, regex, serialization, system, thread) and mpi (e.g., [OpenMPI](http://www.open-mpi.org/)) |
| 11 | + ```sh |
| 12 | + sudo apt-get install libboost-dev libopenmpi-dev |
| 13 | + ``` |
| 14 | + 2. ClusterWork requires the Python packages PyYAML, job_stream and pandas |
| 15 | + ```sh |
| 16 | + pip install PyYAML job_stream pandas |
| 17 | + ``` |
| 18 | +2. Clone this repository and install it |
| 19 | +```sh |
| 20 | +git clone https://github.com/gregorgebhardt/cluster_work cluster_work |
| 21 | +cd cluster_work |
| 22 | +pip install . |
| 23 | +``` |
| 24 | + |
| 25 | +## Get your code on the computing cluster |
| 26 | +Running your code on the computing cluster is now a very simple task. |
| 27 | +Currently, this requires the following three steps: |
| 28 | + |
| 29 | +1. Write a Python class that inherits from `ClusterWork` and implements at least the methods `reset(self, config: dict, rep: int)` and `iterate(self, config: dict, rep: int, n: int)`. |
| 30 | +2. Write a simple YAML-file to configure your experiment. |
| 31 | +3. Adopt a shell script that starts the experiment on your cluster. |
| 32 | + |
| 33 | +### Subclassing `ClusterWork` |
| 34 | + |
| 35 | +```Python |
| 36 | +from cluster_work import ClusterWork |
| 37 | +
|
| 38 | +class MyExperiment(ClusterWork): |
| 39 | + # ... |
| 40 | +
|
| 41 | + def reset(self, config=None, rep=0): |
| 42 | + # run code that sets up your experiment for each repetition here |
| 43 | + pass |
| 44 | +
|
| 45 | + def iterate(self, config=None, rep=0, n=0): |
| 46 | + # run your experiment for iteration n |
| 47 | + # return results as a dictionary, for each key there will be one column in a results table. |
| 48 | + pass |
| 49 | +
|
| 50 | +
|
| 51 | +# to run the experiments, you simply call run on your derived class |
| 52 | +if __name__ == '__main__': |
| 53 | + MyExperiment.run() |
| 54 | +``` |
| 55 | +
|
| 56 | +#### Restarting your experiments |
| 57 | +
|
| 58 | +ClusterWork also implement a restart functionality. Since your results are stored after each iteration, your experiment can be restarted if its execution was interrupted for some reason. To obtain this functionality, you need to implement in addition at least the method `restore_state(self, config: dict, rep: int, n: int)`. Additionally, the method `save_state(self, config: dict, rep: int, n: int)` can be implemented to store additional information the needs to be loaded in the `restore_state` method. Finally, a flag `_restore_supported` must be set to `True`. |
| 59 | +
|
| 60 | +```Python |
| 61 | +class MyExperiment(ClusterWork): |
| 62 | + _restore_supported = True |
| 63 | +
|
| 64 | + # ... |
| 65 | +
|
| 66 | + def save_state(self, config: dict, rep: int, n: int): |
| 67 | + # save all the necessary information for restoring the state later |
| 68 | + pass |
| 69 | +
|
| 70 | + def restore_state(self, config: dict, rep: int, n: int): |
| 71 | + # load or reconstruct the state at repetition rep and iteration n |
| 72 | + pass |
| 73 | +``` |
| 74 | +
|
| 75 | +#### Default parameters |
| 76 | +
|
| 77 | +The parameters for the experiment can be defined in an YAML-file that is passed as an command-line argument. Inside the derived class, we can define default parameters as a dictionary in the `_default_params` field: |
| 78 | +
|
| 79 | +```Python |
| 80 | +class MyExperiment(ClusterWork): |
| 81 | + # ... |
| 82 | +
|
| 83 | + _default_params = { |
| 84 | + # ... |
| 85 | + 'num_episodes': 100, |
| 86 | + 'num_test_episodes': 10, |
| 87 | + 'num_eval_episodes': 10, |
| 88 | + 'num_steps': 30, |
| 89 | + 'num_observations': 30, |
| 90 | +
|
| 91 | + 'optimizer_options': { |
| 92 | + 'maxiter': 100 |
| 93 | + }, |
| 94 | + # ... |
| 95 | + } |
| 96 | +
|
| 97 | + # ... |
| 98 | +``` |
| 99 | +
|
| 100 | +### The Configuration YAML |
| 101 | +
|
| 102 | +To configure the execution of the experiment, we need to write a small YAML-file. The YAML file consists several documents which are separated by a line of `---`. Optionally, the first document can be made a default by setting the key `name` to `"DEFAULT"`. This default document will then form the basis for all following experiment documents. Besides the optional default document, each document represents an experiment. However, experiments can be expanded by the _list_ __or__ _grid_ feature, which is explained below. |
| 103 | +
|
| 104 | +The required keys for each experiment are `name`, `repetitions`, `iterations`, and `path`. The parameters found below the key `params` overwrite the default parameters defined in the experiment class. Since the `config` dictionary that is passed to the methods of the ClusterWork subclass is the full configuration generated from the YAML-file and the default parameters, additional keys can be used. |
| 105 | +
|
| 106 | +```YAML |
| 107 | +--- |
| 108 | +# default document denoted by the name "DEFAULT" |
| 109 | +name: "DEFAULT" |
| 110 | +repetitions: 20 |
| 111 | +iterations: 5 |
| 112 | +# this is the path where the results are stored, |
| 113 | +# it can be different from the location of the experiment scripts |
| 114 | +path: "path/to/experiment/folder" |
| 115 | +
|
| 116 | +params: |
| 117 | + num_episodes: 150 |
| 118 | + optimizer_options: |
| 119 | + maxiter: 50 |
| 120 | +--- |
| 121 | +# 1. experiment |
| 122 | +name: "more_test_episodes" |
| 123 | +params: |
| 124 | + num_test_episodes: False |
| 125 | +--- |
| 126 | +# 2. experiment |
| 127 | +name: "more_steps" |
| 128 | +params: |
| 129 | + num_steps: 50 |
| 130 | +``` |
| 131 | +
|
| 132 | +#### The list feature |
| 133 | +
|
| 134 | +If the key `list` is given in an experiment document, the experiment will be expanded for each value in the given list. For example |
| 135 | +
|
| 136 | +``` |
| 137 | +# ... |
| 138 | +--- |
| 139 | +name: "test_parameter_a" |
| 140 | +params: |
| 141 | + b: 5 |
| 142 | +list: |
| 143 | + a: [5, 10, 20, 30] |
| 144 | +``` |
| 145 | +
|
| 146 | +creates four experiments, one for each value of `a`. It is also possible define multiple parameters below the `list` key: |
| 147 | +
|
| 148 | +``` |
| 149 | +# ... |
| 150 | +--- |
| 151 | +name: "test_parameter_a_and_c" |
| 152 | +params: |
| 153 | + b: 5 |
| 154 | +list: |
| 155 | + a: [5, 10, 20, 30] |
| 156 | + c: [1, 2, 3, 4, 5] |
| 157 | +``` |
| 158 | +
|
| 159 | +In this case the experiment is run for the four parameter combinations `{a: 5, c: 1}`, `{a: 10, c: 2}`, ..., `{a: 30, c: 4}`. Since the list for `a` is shorter than the list for `c`, the remaining values for `c` are ignored. |
| 160 | +
|
| 161 | +#### The grid feature |
| 162 | +
|
| 163 | +The `grid` feature is similar to the `list` feature, however instead of iterating over all lists jointly it spans a grid with all the values given. For example |
| 164 | +
|
| 165 | +``` |
| 166 | +# ... |
| 167 | +--- |
| 168 | +name: "test_parameters_foo_and_bar" |
| 169 | +params: |
| 170 | + a: 5 |
| 171 | +grid: |
| 172 | + foo: [5, 10, 20, 30] |
| 173 | + bar: [1, 2, 3, 4, 5] |
| 174 | +``` |
| 175 | +
|
| 176 | +would run an experiment for each combination of `foo` and `bar`, in this case 4x5=20 experiments. Note that with more parameters below the `grid` key the number of experiments explodes exponentially. |
| 177 | +
|
| 178 | +### Run the experiment |
| 179 | +
|
| 180 | +To run your experiment, you can simply execute your python script: |
| 181 | +
|
| 182 | +``` |
| 183 | +python YOUR_SCRIPT.py YOUR_CONFIG.yml [arguments] |
| 184 | +``` |
| 185 | +
|
| 186 | +The following arguments are available: |
| 187 | +
|
| 188 | ++ `-c, --cluster` runs the experiments using the job_stream scheduler. By default the experiments are executed sequentially in a loop. |
| 189 | ++ `-d, --delete` delete old results before running your experiments |
| 190 | ++ `-e, --experiments [experiments]` chooses the experiments that should run, by default all experiments will run. |
| 191 | ++ `-v, --verbose` shows more output |
| 192 | ++ `-p, --progress` displays only the progress of running or finished experiments. |
| 193 | +
|
| 194 | +#### Hostfile for OpenMPI |
| 195 | +
|
| 196 | +Depending on the configuration of your cluster, `job_stream` might not know about the available computing units in the cluster. In this case it is possible to pass a hostfile that specifies hostnames and available cpus to job_stream before starting the experiment script. |
| 197 | +
|
| 198 | +```sh |
| 199 | +job_stream --hostfile HOSTFILE -- python YOUR_SCRIPT.py YOUR_CONFIG.yml [arguments] |
| 200 | +``` |
| 201 | +
|
| 202 | +For a SLURM-based cluster, this hostfile can be created by |
| 203 | +
|
| 204 | +``` |
| 205 | +srun hostname > hostfile.$SLURM_JOB_ID |
| 206 | +hostfileconv hostfile.$SLURM_JOB_ID |
| 207 | +``` |
| 208 | +
|
| 209 | +where `hostfileconv` is a tool provided by ClusterWork that makes sure the hostfile has the right format. In this case the `HOSTFILE` argument for job_stream would be `hostfile.$SLURM_JOB_ID.converted`. |
0 commit comments