|
12 | 12 | 1. Convenient high-level interfaces for applications of RL (training an implemented algorithm on a custom environment). |
13 | 13 | 1. Large scope: online (on- and off-policy) and offline RL, experimental support for multi-agent RL (MARL), experimental support for model-based RL, and more |
14 | 14 |
|
15 | | - |
16 | 15 | Unlike other reinforcement learning libraries, which may have complex codebases, |
17 | 16 | unfriendly high-level APIs, or are not optimized for speed, Tianshou provides a high-performance, modularized framework |
18 | 17 | and user-friendly interfaces for building deep reinforcement learning agents. One more aspect that sets Tianshou apart is its |
@@ -183,15 +182,17 @@ Atari and MuJoCo benchmark results can be found in the [examples/atari/](example |
183 | 182 | ### Algorithm Abstraction |
184 | 183 |
|
185 | 184 | Reinforcement learning algorithms are build on abstractions for |
186 | | - * on-policy algorithms (`OnPolicyAlgorithm`), |
187 | | - * off-policy algorithms (`OffPolicyAlgorithm`), and |
188 | | - * offline algorithms (`OfflineAlgorithm`), |
| 185 | + |
| 186 | +- on-policy algorithms (`OnPolicyAlgorithm`), |
| 187 | +- off-policy algorithms (`OffPolicyAlgorithm`), and |
| 188 | +- offline algorithms (`OfflineAlgorithm`), |
189 | 189 |
|
190 | 190 | all of which clearly separate the core algorithm from the training process and the respective environment interactions. |
191 | 191 |
|
192 | 192 | In each case, the implementation of an algorithm necessarily involves only the implementation of methods for |
193 | | - * pre-processing a batch of data, augmenting it with necessary information/sufficient statistics for learning (`_preprocess_batch`), |
194 | | - * updating model parameters based on an augmented batch of data (`_update_with_batch`). |
| 193 | + |
| 194 | +- pre-processing a batch of data, augmenting it with necessary information/sufficient statistics for learning (`_preprocess_batch`), |
| 195 | +- updating model parameters based on an augmented batch of data (`_update_with_batch`). |
195 | 196 |
|
196 | 197 | The implementation of these methods suffices for a new algorithm to be applicable within Tianshou, |
197 | 198 | making experimentation with new approaches particularly straightforward. |
@@ -249,12 +250,12 @@ experiment = ( |
249 | 250 | ), |
250 | 251 | OffPolicyTrainingConfig( |
251 | 252 | num_epochs=10, |
252 | | - step_per_epoch=10000, |
| 253 | + epoch_num_steps=10000, |
253 | 254 | batch_size=64, |
254 | 255 | num_train_envs=10, |
255 | 256 | num_test_envs=100, |
256 | 257 | buffer_size=20000, |
257 | | - step_per_collect=10, |
| 258 | + collection_step_num_env_steps=10, |
258 | 259 | update_per_step=1 / 10, |
259 | 260 | ), |
260 | 261 | ) |
@@ -288,21 +289,21 @@ The experiment builder takes three arguments: |
288 | 289 | - the training configuration, which controls fundamental training parameters, |
289 | 290 | such as the total number of epochs we run the experiment for (`num_epochs=10`) |
290 | 291 | and the number of environment steps each epoch shall consist of |
291 | | - (`step_per_epoch=10000`). |
| 292 | + (`epoch_num_steps=10000`). |
292 | 293 | Every epoch consists of a series of data collection (rollout) steps and |
293 | 294 | training steps. |
294 | | - The parameter `step_per_collect` controls the amount of data that is |
| 295 | + The parameter `collection_step_num_env_steps` controls the amount of data that is |
295 | 296 | collected in each collection step and after each collection step, we |
296 | 297 | perform a training step, applying a gradient-based update based on a sample |
297 | 298 | of data (`batch_size=64`) taken from the buffer of data that has been |
298 | 299 | collected. For further details, see the documentation of configuration class. |
299 | 300 |
|
300 | 301 | We then proceed to configure some of the parameters of the DQN algorithm itself |
301 | 302 | and of the neural network model we want to use. |
302 | | -A DQN-specific detail is the way in which we control the epsilon parameter for |
303 | | -exploration. |
304 | | -We want to use random exploration during rollouts for training (`eps_training`), |
305 | | -but we don't when evaluating the agent's performance in the test environments |
| 303 | +A DQN-specific detail is the way in which we control the epsilon parameter for |
| 304 | +exploration. |
| 305 | +We want to use random exploration during rollouts for training (`eps_training`), |
| 306 | +but we don't when evaluating the agent's performance in the test environments |
306 | 307 | (`eps_inference`). |
307 | 308 |
|
308 | 309 | Find the script in [examples/discrete/discrete_dqn_hl.py](examples/discrete/discrete_dqn_hl.py). |
@@ -340,7 +341,7 @@ train_num, test_num = 10, 100 |
340 | 341 | gamma, n_step, target_freq = 0.9, 3, 320 |
341 | 342 | buffer_size = 20000 |
342 | 343 | eps_train, eps_test = 0.1, 0.05 |
343 | | -step_per_epoch, step_per_collect = 10000, 10 |
| 344 | +epoch_num_steps, collection_step_num_env_steps = 10000, 10 |
344 | 345 | ``` |
345 | 346 |
|
346 | 347 | Initialize the logger: |
@@ -400,11 +401,11 @@ result = ts.trainer.OffPolicyTrainer( |
400 | 401 | train_collector=train_collector, |
401 | 402 | test_collector=test_collector, |
402 | 403 | max_epoch=epoch, |
403 | | - step_per_epoch=step_per_epoch, |
404 | | - step_per_collect=step_per_collect, |
| 404 | + epoch_num_steps=epoch_num_steps, |
| 405 | + collection_step_num_env_steps=collection_step_num_env_steps, |
405 | 406 | episode_per_test=test_num, |
406 | 407 | batch_size=batch_size, |
407 | | - update_per_step=1 / step_per_collect, |
| 408 | + update_per_step=1 / collection_step_num_env_steps, |
408 | 409 | train_fn=lambda epoch, env_step: policy.set_eps_training(eps_train), |
409 | 410 | test_fn=lambda epoch, env_step: policy.set_eps_training(eps_test), |
410 | 411 | stop_fn=lambda mean_rewards: mean_rewards >= env.spec.reward_threshold, |
|
0 commit comments