Skip to content

Commit 86fba86

Browse files
committed
Merge remote-tracking branch 'origin/dev-v2' into dev-v2
Conflicts: docs/02_notebooks/L0_overview.ipynb docs/02_notebooks/L6_Trainer.ipynb docs/02_notebooks/L7_Experiment.ipynb
2 parents e9fe650 + 3ff9423 commit 86fba86

File tree

117 files changed

+2042
-1963
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

117 files changed

+2042
-1963
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,9 @@ Developers:
109109
`LRSchedulerFactory`).
110110
The parameter `lr_scheduler` has thus been removed from all algorithm constructors.
111111
* The flag `updating` has been removed (no internal usage, general usefulness questionable).
112+
* Removed `max_action_num`, instead read it off from `action_space`
112113
* Parameter changes:
114+
* `actor_step_size` -> `trust_region_size` in NP
113115
* `discount_factor` -> `gamma` (was already used internally almost everywhere)
114116
* `reward_normalization` -> `return_standardization` or `return_scaling` (more precise naming) or removed (was actually unsupported by Q-learning algorithms)
115117
* `return_standardization` in `Reinforce` and `DiscreteCRR` (as it applies standardization of returns)

README.md

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
1. Convenient high-level interfaces for applications of RL (training an implemented algorithm on a custom environment).
1313
1. Large scope: online (on- and off-policy) and offline RL, experimental support for multi-agent RL (MARL), experimental support for model-based RL, and more
1414

15-
1615
Unlike other reinforcement learning libraries, which may have complex codebases,
1716
unfriendly high-level APIs, or are not optimized for speed, Tianshou provides a high-performance, modularized framework
1817
and user-friendly interfaces for building deep reinforcement learning agents. One more aspect that sets Tianshou apart is its
@@ -183,15 +182,17 @@ Atari and MuJoCo benchmark results can be found in the [examples/atari/](example
183182
### Algorithm Abstraction
184183

185184
Reinforcement learning algorithms are build on abstractions for
186-
* on-policy algorithms (`OnPolicyAlgorithm`),
187-
* off-policy algorithms (`OffPolicyAlgorithm`), and
188-
* offline algorithms (`OfflineAlgorithm`),
185+
186+
- on-policy algorithms (`OnPolicyAlgorithm`),
187+
- off-policy algorithms (`OffPolicyAlgorithm`), and
188+
- offline algorithms (`OfflineAlgorithm`),
189189

190190
all of which clearly separate the core algorithm from the training process and the respective environment interactions.
191191

192192
In each case, the implementation of an algorithm necessarily involves only the implementation of methods for
193-
* pre-processing a batch of data, augmenting it with necessary information/sufficient statistics for learning (`_preprocess_batch`),
194-
* updating model parameters based on an augmented batch of data (`_update_with_batch`).
193+
194+
- pre-processing a batch of data, augmenting it with necessary information/sufficient statistics for learning (`_preprocess_batch`),
195+
- updating model parameters based on an augmented batch of data (`_update_with_batch`).
195196

196197
The implementation of these methods suffices for a new algorithm to be applicable within Tianshou,
197198
making experimentation with new approaches particularly straightforward.
@@ -249,12 +250,12 @@ experiment = (
249250
),
250251
OffPolicyTrainingConfig(
251252
num_epochs=10,
252-
step_per_epoch=10000,
253+
epoch_num_steps=10000,
253254
batch_size=64,
254255
num_train_envs=10,
255256
num_test_envs=100,
256257
buffer_size=20000,
257-
step_per_collect=10,
258+
collection_step_num_env_steps=10,
258259
update_per_step=1 / 10,
259260
),
260261
)
@@ -288,21 +289,21 @@ The experiment builder takes three arguments:
288289
- the training configuration, which controls fundamental training parameters,
289290
such as the total number of epochs we run the experiment for (`num_epochs=10`)
290291
and the number of environment steps each epoch shall consist of
291-
(`step_per_epoch=10000`).
292+
(`epoch_num_steps=10000`).
292293
Every epoch consists of a series of data collection (rollout) steps and
293294
training steps.
294-
The parameter `step_per_collect` controls the amount of data that is
295+
The parameter `collection_step_num_env_steps` controls the amount of data that is
295296
collected in each collection step and after each collection step, we
296297
perform a training step, applying a gradient-based update based on a sample
297298
of data (`batch_size=64`) taken from the buffer of data that has been
298299
collected. For further details, see the documentation of configuration class.
299300

300301
We then proceed to configure some of the parameters of the DQN algorithm itself
301302
and of the neural network model we want to use.
302-
A DQN-specific detail is the way in which we control the epsilon parameter for
303-
exploration.
304-
We want to use random exploration during rollouts for training (`eps_training`),
305-
but we don't when evaluating the agent's performance in the test environments
303+
A DQN-specific detail is the way in which we control the epsilon parameter for
304+
exploration.
305+
We want to use random exploration during rollouts for training (`eps_training`),
306+
but we don't when evaluating the agent's performance in the test environments
306307
(`eps_inference`).
307308

308309
Find the script in [examples/discrete/discrete_dqn_hl.py](examples/discrete/discrete_dqn_hl.py).
@@ -340,7 +341,7 @@ train_num, test_num = 10, 100
340341
gamma, n_step, target_freq = 0.9, 3, 320
341342
buffer_size = 20000
342343
eps_train, eps_test = 0.1, 0.05
343-
step_per_epoch, step_per_collect = 10000, 10
344+
epoch_num_steps, collection_step_num_env_steps = 10000, 10
344345
```
345346

346347
Initialize the logger:
@@ -400,11 +401,11 @@ result = ts.trainer.OffPolicyTrainer(
400401
train_collector=train_collector,
401402
test_collector=test_collector,
402403
max_epoch=epoch,
403-
step_per_epoch=step_per_epoch,
404-
step_per_collect=step_per_collect,
404+
epoch_num_steps=epoch_num_steps,
405+
collection_step_num_env_steps=collection_step_num_env_steps,
405406
episode_per_test=test_num,
406407
batch_size=batch_size,
407-
update_per_step=1 / step_per_collect,
408+
update_per_step=1 / collection_step_num_env_steps,
408409
train_fn=lambda epoch, env_step: policy.set_eps_training(eps_train),
409410
test_fn=lambda epoch, env_step: policy.set_eps_training(eps_test),
410411
stop_fn=lambda mean_rewards: mean_rewards >= env.spec.reward_threshold,

docs/01_tutorials/00_dqn.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ reaches the stop condition ``stop_fn`` on test collector. Since DQN is an off-po
198198
policy=policy,
199199
train_collector=train_collector,
200200
test_collector=test_collector,
201-
max_epoch=10, step_per_epoch=10000, step_per_collect=10,
201+
max_epoch=10, epoch_num_steps=10000, collection_step_num_env_steps=10,
202202
update_per_step=0.1, episode_per_test=100, batch_size=64,
203203
train_fn=lambda epoch, env_step: policy.set_eps(0.1),
204204
test_fn=lambda epoch, env_step: policy.set_eps(0.05),
@@ -209,8 +209,8 @@ reaches the stop condition ``stop_fn`` on test collector. Since DQN is an off-po
209209
The meaning of each parameter is as follows (full description can be found at :class:`~tianshou.trainer.OffpolicyTrainer`):
210210

211211
* ``max_epoch``: The maximum of epochs for training. The training process might be finished before reaching the ``max_epoch``;
212-
* ``step_per_epoch``: The number of environment step (a.k.a. transition) collected per epoch;
213-
* ``step_per_collect``: The number of transition the collector would collect before the network update. For example, the code above means "collect 10 transitions and do one policy network update";
212+
* ``epoch_num_steps``: The number of environment step (a.k.a. transition) collected per epoch;
213+
* ``collection_step_num_env_steps``: The number of transition the collector would collect before the network update. For example, the code above means "collect 10 transitions and do one policy network update";
214214
* ``episode_per_test``: The number of episodes for one policy evaluation.
215215
* ``batch_size``: The batch size of sample data, which is going to feed in the policy network.
216216
* ``train_fn``: A function receives the current number of epoch and step index, and performs some operations at the beginning of training in this epoch. For example, the code above means "reset the epsilon to 0.1 in DQN before training".

docs/01_tutorials/04_tictactoe.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -224,15 +224,15 @@ The explanation of each Tianshou class/function will be deferred to their first
224224
parser.add_argument('--n-step', type=int, default=3)
225225
parser.add_argument('--target-update-freq', type=int, default=320)
226226
parser.add_argument('--epoch', type=int, default=50)
227-
parser.add_argument('--step-per-epoch', type=int, default=1000)
228-
parser.add_argument('--step-per-collect', type=int, default=10)
227+
parser.add_argument('--epoch_num_steps', type=int, default=1000)
228+
parser.add_argument('--collection_step_num_env_steps', type=int, default=10)
229229
parser.add_argument('--update-per-step', type=float, default=0.1)
230-
parser.add_argument('--batch-size', type=int, default=64)
230+
parser.add_argument('--batch_size', type=int, default=64)
231231
parser.add_argument(
232232
'--hidden-sizes', type=int, nargs='*', default=[128, 128, 128, 128]
233233
)
234-
parser.add_argument('--training-num', type=int, default=10)
235-
parser.add_argument('--test-num', type=int, default=10)
234+
parser.add_argument('--num_train_envs', type=int, default=10)
235+
parser.add_argument('--num_test_envs', type=int, default=10)
236236
parser.add_argument('--logdir', type=str, default='log')
237237
parser.add_argument('--render', type=float, default=0.1)
238238
parser.add_argument(
@@ -356,7 +356,7 @@ With the above preparation, we are close to the first learned agent. The followi
356356
) -> Tuple[dict, BasePolicy]:
357357

358358
# ======== environment setup =========
359-
train_envs = DummyVectorEnv([get_env for _ in range(args.training_num)])
359+
train_envs = DummyVectorEnv([get_env for _ in range(args.num_train_envs)])
360360
test_envs = DummyVectorEnv([get_env for _ in range(args.test_num)])
361361
# seed
362362
np.random.seed(args.seed)
@@ -378,7 +378,7 @@ With the above preparation, we are close to the first learned agent. The followi
378378
)
379379
test_collector = Collector(policy, test_envs, exploration_noise=True)
380380
# policy.set_eps(1)
381-
train_collector.collect(n_step=args.batch_size * args.training_num)
381+
train_collector.collect(n_step=args.batch_size * args.num_train_envs)
382382

383383
# ======== tensorboard logging setup =========
384384
log_path = os.path.join(args.logdir, 'tic_tac_toe', 'dqn')
@@ -416,8 +416,8 @@ With the above preparation, we are close to the first learned agent. The followi
416416
train_collector,
417417
test_collector,
418418
args.epoch,
419-
args.step_per_epoch,
420-
args.step_per_collect,
419+
args.epoch_num_steps,
420+
args.collection_step_num_env_steps,
421421
args.test_num,
422422
args.batch_size,
423423
train_fn=train_fn,

examples/atari/README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,13 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
2424

2525
| task | best reward | reward curve | parameters | time cost |
2626
| --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ | ------------------- |
27-
| PongNoFrameskip-v4 | 20 | ![](results/dqn/Pong_rew.png) | `python3 atari_dqn.py --task "PongNoFrameskip-v4" --batch-size 64` | ~30 min (~15 epoch) |
28-
| BreakoutNoFrameskip-v4 | 316 | ![](results/dqn/Breakout_rew.png) | `python3 atari_dqn.py --task "BreakoutNoFrameskip-v4" --test-num 100` | 3~4h (100 epoch) |
29-
| EnduroNoFrameskip-v4 | 670 | ![](results/dqn/Enduro_rew.png) | `python3 atari_dqn.py --task "EnduroNoFrameskip-v4 " --test-num 100` | 3~4h (100 epoch) |
30-
| QbertNoFrameskip-v4 | 7307 | ![](results/dqn/Qbert_rew.png) | `python3 atari_dqn.py --task "QbertNoFrameskip-v4" --test-num 100` | 3~4h (100 epoch) |
31-
| MsPacmanNoFrameskip-v4 | 2107 | ![](results/dqn/MsPacman_rew.png) | `python3 atari_dqn.py --task "MsPacmanNoFrameskip-v4" --test-num 100` | 3~4h (100 epoch) |
32-
| SeaquestNoFrameskip-v4 | 2088 | ![](results/dqn/Seaquest_rew.png) | `python3 atari_dqn.py --task "SeaquestNoFrameskip-v4" --test-num 100` | 3~4h (100 epoch) |
33-
| SpaceInvadersNoFrameskip-v4 | 812.2 | ![](results/dqn/SpaceInvader_rew.png) | `python3 atari_dqn.py --task "SpaceInvadersNoFrameskip-v4" --test-num 100` | 3~4h (100 epoch) |
27+
| PongNoFrameskip-v4 | 20 | ![](results/dqn/Pong_rew.png) | `python3 atari_dqn.py --task "PongNoFrameskip-v4" --batch_size 64` | ~30 min (~15 epoch) |
28+
| BreakoutNoFrameskip-v4 | 316 | ![](results/dqn/Breakout_rew.png) | `python3 atari_dqn.py --task "BreakoutNoFrameskip-v4" --num_test_envs 100` | 3~4h (100 epoch) |
29+
| EnduroNoFrameskip-v4 | 670 | ![](results/dqn/Enduro_rew.png) | `python3 atari_dqn.py --task "EnduroNoFrameskip-v4 " --num_test_envs 100` | 3~4h (100 epoch) |
30+
| QbertNoFrameskip-v4 | 7307 | ![](results/dqn/Qbert_rew.png) | `python3 atari_dqn.py --task "QbertNoFrameskip-v4" --num_test_envs 100` | 3~4h (100 epoch) |
31+
| MsPacmanNoFrameskip-v4 | 2107 | ![](results/dqn/MsPacman_rew.png) | `python3 atari_dqn.py --task "MsPacmanNoFrameskip-v4" --num_test_envs 100` | 3~4h (100 epoch) |
32+
| SeaquestNoFrameskip-v4 | 2088 | ![](results/dqn/Seaquest_rew.png) | `python3 atari_dqn.py --task "SeaquestNoFrameskip-v4" --num_test_envs 100` | 3~4h (100 epoch) |
33+
| SpaceInvadersNoFrameskip-v4 | 812.2 | ![](results/dqn/SpaceInvader_rew.png) | `python3 atari_dqn.py --task "SpaceInvadersNoFrameskip-v4" --num_test_envs 100` | 3~4h (100 epoch) |
3434

3535
Note: The `eps_train_final` and `eps_test` in the original DQN paper is 0.1 and 0.01, but [some works](https://github.com/google/dopamine/tree/master/baselines) found that smaller eps helps improve the performance. Also, a large batchsize (say 64 instead of 32) will help faster convergence but will slow down the training speed.
3636

@@ -42,7 +42,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
4242

4343
| task | best reward | reward curve | parameters |
4444
| --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
45-
| PongNoFrameskip-v4 | 20 | ![](results/c51/Pong_rew.png) | `python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64` |
45+
| PongNoFrameskip-v4 | 20 | ![](results/c51/Pong_rew.png) | `python3 atari_c51.py --task "PongNoFrameskip-v4" --batch_size 64` |
4646
| BreakoutNoFrameskip-v4 | 536.6 | ![](results/c51/Breakout_rew.png) | `python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
4747
| EnduroNoFrameskip-v4 | 1032 | ![](results/c51/Enduro_rew.png) | `python3 atari_c51.py --task "EnduroNoFrameskip-v4 " ` |
4848
| QbertNoFrameskip-v4 | 16245 | ![](results/c51/Qbert_rew.png) | `python3 atari_c51.py --task "QbertNoFrameskip-v4"` |
@@ -58,7 +58,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
5858

5959
| task | best reward | reward curve | parameters |
6060
| --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
61-
| PongNoFrameskip-v4 | 20 | ![](results/qrdqn/Pong_rew.png) | `python3 atari_qrdqn.py --task "PongNoFrameskip-v4" --batch-size 64` |
61+
| PongNoFrameskip-v4 | 20 | ![](results/qrdqn/Pong_rew.png) | `python3 atari_qrdqn.py --task "PongNoFrameskip-v4" --batch_size 64` |
6262
| BreakoutNoFrameskip-v4 | 409.2 | ![](results/qrdqn/Breakout_rew.png) | `python3 atari_qrdqn.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
6363
| EnduroNoFrameskip-v4 | 1055.9 | ![](results/qrdqn/Enduro_rew.png) | `python3 atari_qrdqn.py --task "EnduroNoFrameskip-v4"` |
6464
| QbertNoFrameskip-v4 | 14990 | ![](results/qrdqn/Qbert_rew.png) | `python3 atari_qrdqn.py --task "QbertNoFrameskip-v4"` |
@@ -72,7 +72,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
7272

7373
| task | best reward | reward curve | parameters |
7474
| --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
75-
| PongNoFrameskip-v4 | 20.3 | ![](results/iqn/Pong_rew.png) | `python3 atari_iqn.py --task "PongNoFrameskip-v4" --batch-size 64` |
75+
| PongNoFrameskip-v4 | 20.3 | ![](results/iqn/Pong_rew.png) | `python3 atari_iqn.py --task "PongNoFrameskip-v4" --batch_size 64` |
7676
| BreakoutNoFrameskip-v4 | 496.7 | ![](results/iqn/Breakout_rew.png) | `python3 atari_iqn.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
7777
| EnduroNoFrameskip-v4 | 1545 | ![](results/iqn/Enduro_rew.png) | `python3 atari_iqn.py --task "EnduroNoFrameskip-v4"` |
7878
| QbertNoFrameskip-v4 | 15342.5 | ![](results/iqn/Qbert_rew.png) | `python3 atari_iqn.py --task "QbertNoFrameskip-v4"` |
@@ -86,7 +86,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
8686

8787
| task | best reward | reward curve | parameters |
8888
| --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
89-
| PongNoFrameskip-v4 | 20.7 | ![](results/fqf/Pong_rew.png) | `python3 atari_fqf.py --task "PongNoFrameskip-v4" --batch-size 64` |
89+
| PongNoFrameskip-v4 | 20.7 | ![](results/fqf/Pong_rew.png) | `python3 atari_fqf.py --task "PongNoFrameskip-v4" --batch_size 64` |
9090
| BreakoutNoFrameskip-v4 | 517.3 | ![](results/fqf/Breakout_rew.png) | `python3 atari_fqf.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
9191
| EnduroNoFrameskip-v4 | 2240.5 | ![](results/fqf/Enduro_rew.png) | `python3 atari_fqf.py --task "EnduroNoFrameskip-v4"` |
9292
| QbertNoFrameskip-v4 | 16172.5 | ![](results/fqf/Qbert_rew.png) | `python3 atari_fqf.py --task "QbertNoFrameskip-v4"` |
@@ -100,7 +100,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
100100

101101
| task | best reward | reward curve | parameters |
102102
| --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
103-
| PongNoFrameskip-v4 | 21 | ![](results/rainbow/Pong_rew.png) | `python3 atari_rainbow.py --task "PongNoFrameskip-v4" --batch-size 64` |
103+
| PongNoFrameskip-v4 | 21 | ![](results/rainbow/Pong_rew.png) | `python3 atari_rainbow.py --task "PongNoFrameskip-v4" --batch_size 64` |
104104
| BreakoutNoFrameskip-v4 | 684.6 | ![](results/rainbow/Breakout_rew.png) | `python3 atari_rainbow.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
105105
| EnduroNoFrameskip-v4 | 1625.9 | ![](results/rainbow/Enduro_rew.png) | `python3 atari_rainbow.py --task "EnduroNoFrameskip-v4"` |
106106
| QbertNoFrameskip-v4 | 16192.5 | ![](results/rainbow/Qbert_rew.png) | `python3 atari_rainbow.py --task "QbertNoFrameskip-v4"` |

0 commit comments

Comments
 (0)