thu-ml
diff --git a/‎CHANGELOG.md‎
Lines changed: 2 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 19 additions & 18 deletions b/‎README.md‎
Lines changed: 19 additions & 18 deletions
diff --git a/‎docs/01_tutorials/00_dqn.rst‎
Lines changed: 3 additions & 3 deletions b/‎docs/01_tutorials/00_dqn.rst‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/01_tutorials/04_tictactoe.rst‎
Lines changed: 9 additions & 9 deletions b/‎docs/01_tutorials/04_tictactoe.rst‎
Lines changed: 9 additions & 9 deletions
diff --git a/‎examples/atari/README.md‎
Lines changed: 12 additions & 12 deletions b/‎examples/atari/README.md‎
Lines changed: 12 additions & 12 deletions
@@ -109,7 +109,9 @@ Developers:
       `LRSchedulerFactory`).
       The parameter `lr_scheduler` has thus been removed from all algorithm constructors.
     * The flag `updating` has been removed (no internal usage, general usefulness questionable).
+    * Removed `max_action_num`, instead read it off from `action_space`
     * Parameter changes:
+        * `actor_step_size` -> `trust_region_size` in NP
         * `discount_factor` -> `gamma` (was already used internally almost everywhere) 
         * `reward_normalization` -> `return_standardization` or `return_scaling` (more precise naming) or removed (was actually unsupported by Q-learning algorithms)
             * `return_standardization` in `Reinforce` and `DiscreteCRR` (as it applies standardization of returns)
 
@@ -12,7 +12,6 @@
 1. Convenient high-level interfaces for applications of RL (training an implemented algorithm on a custom environment).
 1. Large scope: online (on- and off-policy) and offline RL, experimental support for multi-agent RL (MARL), experimental support for model-based RL, and more
 
-
 Unlike other reinforcement learning libraries, which may have complex codebases,
 unfriendly high-level APIs, or are not optimized for speed, Tianshou provides a high-performance, modularized framework
 and user-friendly interfaces for building deep reinforcement learning agents. One more aspect that sets Tianshou apart is its
@@ -183,15 +182,17 @@ Atari and MuJoCo benchmark results can be found in the [examples/atari/](example
 ### Algorithm Abstraction
 
 Reinforcement learning algorithms are build on abstractions for
-  * on-policy algorithms (`OnPolicyAlgorithm`),
-  * off-policy algorithms (`OffPolicyAlgorithm`), and
-  * offline algorithms (`OfflineAlgorithm`),  
+
+- on-policy algorithms (`OnPolicyAlgorithm`),
+- off-policy algorithms (`OffPolicyAlgorithm`), and
+- offline algorithms (`OfflineAlgorithm`),
 
 all of which clearly separate the core algorithm from the training process and the respective environment interactions.
 
 In each case, the implementation of an algorithm necessarily involves only the implementation of methods for
-  * pre-processing a batch of data, augmenting it with necessary information/sufficient statistics for learning (`_preprocess_batch`),
-  * updating model parameters based on an augmented batch of data (`_update_with_batch`). 
+
+- pre-processing a batch of data, augmenting it with necessary information/sufficient statistics for learning (`_preprocess_batch`),
+- updating model parameters based on an augmented batch of data (`_update_with_batch`).
 
 The implementation of these methods suffices for a new algorithm to be applicable within Tianshou,
 making experimentation with new approaches particularly straightforward.
@@ -249,12 +250,12 @@ experiment = (
         ),
         OffPolicyTrainingConfig(
             num_epochs=10,
-            step_per_epoch=10000,
+            epoch_num_steps=10000,
             batch_size=64,
             num_train_envs=10,
             num_test_envs=100,
             buffer_size=20000,
-            step_per_collect=10,
+            collection_step_num_env_steps=10,
             update_per_step=1 / 10,
         ),
     )
@@ -288,21 +289,21 @@ The experiment builder takes three arguments:
 - the training configuration, which controls fundamental training parameters,
   such as the total number of epochs we run the experiment for (`num_epochs=10`)  
   and the number of environment steps each epoch shall consist of
-  (`step_per_epoch=10000`).
+  (`epoch_num_steps=10000`).
   Every epoch consists of a series of data collection (rollout) steps and
   training steps.
-  The parameter `step_per_collect` controls the amount of data that is
+  The parameter `collection_step_num_env_steps` controls the amount of data that is
   collected in each collection step and after each collection step, we
   perform a training step, applying a gradient-based update based on a sample
   of data (`batch_size=64`) taken from the buffer of data that has been
   collected. For further details, see the documentation of configuration class.
 
 We then proceed to configure some of the parameters of the DQN algorithm itself
 and of the neural network model we want to use.
-A DQN-specific detail is the way in which we control the epsilon parameter for 
-exploration. 
-We want to use random exploration during rollouts for training (`eps_training`), 
-but we don't when evaluating the agent's performance in the test environments 
+A DQN-specific detail is the way in which we control the epsilon parameter for
+exploration.
+We want to use random exploration during rollouts for training (`eps_training`),
+but we don't when evaluating the agent's performance in the test environments
 (`eps_inference`).
 
 Find the script in [examples/discrete/discrete_dqn_hl.py](examples/discrete/discrete_dqn_hl.py).
@@ -340,7 +341,7 @@ train_num, test_num = 10, 100
 gamma, n_step, target_freq = 0.9, 3, 320
 buffer_size = 20000
 eps_train, eps_test = 0.1, 0.05
-step_per_epoch, step_per_collect = 10000, 10
+epoch_num_steps, collection_step_num_env_steps = 10000, 10
 ```
 
 Initialize the logger:
@@ -400,11 +401,11 @@ result = ts.trainer.OffPolicyTrainer(
   train_collector=train_collector,
   test_collector=test_collector,
   max_epoch=epoch,
-  step_per_epoch=step_per_epoch,
-  step_per_collect=step_per_collect,
+  epoch_num_steps=epoch_num_steps,
+  collection_step_num_env_steps=collection_step_num_env_steps,
   episode_per_test=test_num,
   batch_size=batch_size,
-  update_per_step=1 / step_per_collect,
+  update_per_step=1 / collection_step_num_env_steps,
   train_fn=lambda epoch, env_step: policy.set_eps_training(eps_train),
   test_fn=lambda epoch, env_step: policy.set_eps_training(eps_test),
   stop_fn=lambda mean_rewards: mean_rewards >= env.spec.reward_threshold,
 
@@ -198,7 +198,7 @@ reaches the stop condition ``stop_fn`` on test collector. Since DQN is an off-po
         policy=policy,
         train_collector=train_collector,
         test_collector=test_collector,
-        max_epoch=10, step_per_epoch=10000, step_per_collect=10,
+        max_epoch=10, epoch_num_steps=10000, collection_step_num_env_steps=10,
         update_per_step=0.1, episode_per_test=100, batch_size=64,
         train_fn=lambda epoch, env_step: policy.set_eps(0.1),
         test_fn=lambda epoch, env_step: policy.set_eps(0.05),
@@ -209,8 +209,8 @@ reaches the stop condition ``stop_fn`` on test collector. Since DQN is an off-po
 The meaning of each parameter is as follows (full description can be found at :class:`~tianshou.trainer.OffpolicyTrainer`):
 
 * ``max_epoch``: The maximum of epochs for training. The training process might be finished before reaching the ``max_epoch``;
-* ``step_per_epoch``: The number of environment step (a.k.a. transition) collected per epoch;
-* ``step_per_collect``: The number of transition the collector would collect before the network update. For example, the code above means "collect 10 transitions and do one policy network update";
+* ``epoch_num_steps``: The number of environment step (a.k.a. transition) collected per epoch;
+* ``collection_step_num_env_steps``: The number of transition the collector would collect before the network update. For example, the code above means "collect 10 transitions and do one policy network update";
 * ``episode_per_test``: The number of episodes for one policy evaluation.
 * ``batch_size``: The batch size of sample data, which is going to feed in the policy network.
 * ``train_fn``: A function receives the current number of epoch and step index, and performs some operations at the beginning of training in this epoch. For example, the code above means "reset the epsilon to 0.1 in DQN before training".
 
@@ -224,15 +224,15 @@ The explanation of each Tianshou class/function will be deferred to their first
         parser.add_argument('--n-step', type=int, default=3)
         parser.add_argument('--target-update-freq', type=int, default=320)
         parser.add_argument('--epoch', type=int, default=50)
-        parser.add_argument('--step-per-epoch', type=int, default=1000)
-        parser.add_argument('--step-per-collect', type=int, default=10)
+        parser.add_argument('--epoch_num_steps', type=int, default=1000)
+        parser.add_argument('--collection_step_num_env_steps', type=int, default=10)
         parser.add_argument('--update-per-step', type=float, default=0.1)
-        parser.add_argument('--batch-size', type=int, default=64)
+        parser.add_argument('--batch_size', type=int, default=64)
         parser.add_argument(
             '--hidden-sizes', type=int, nargs='*', default=[128, 128, 128, 128]
         )
-        parser.add_argument('--training-num', type=int, default=10)
-        parser.add_argument('--test-num', type=int, default=10)
+        parser.add_argument('--num_train_envs', type=int, default=10)
+        parser.add_argument('--num_test_envs', type=int, default=10)
         parser.add_argument('--logdir', type=str, default='log')
         parser.add_argument('--render', type=float, default=0.1)
         parser.add_argument(
@@ -356,7 +356,7 @@ With the above preparation, we are close to the first learned agent. The followi
     ) -> Tuple[dict, BasePolicy]:
 
         # ======== environment setup =========
-        train_envs = DummyVectorEnv([get_env for _ in range(args.training_num)])
+        train_envs = DummyVectorEnv([get_env for _ in range(args.num_train_envs)])
         test_envs = DummyVectorEnv([get_env for _ in range(args.test_num)])
         # seed
         np.random.seed(args.seed)
@@ -378,7 +378,7 @@ With the above preparation, we are close to the first learned agent. The followi
         )
         test_collector = Collector(policy, test_envs, exploration_noise=True)
         # policy.set_eps(1)
-        train_collector.collect(n_step=args.batch_size * args.training_num)
+        train_collector.collect(n_step=args.batch_size * args.num_train_envs)
 
         # ======== tensorboard logging setup =========
         log_path = os.path.join(args.logdir, 'tic_tac_toe', 'dqn')
@@ -416,8 +416,8 @@ With the above preparation, we are close to the first learned agent. The followi
             train_collector,
             test_collector,
             args.epoch,
-            args.step_per_epoch,
-            args.step_per_collect,
+            args.epoch_num_steps,
+            args.collection_step_num_env_steps,
             args.test_num,
             args.batch_size,
             train_fn=train_fn,
 
@@ -24,13 +24,13 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
 
 | task                        | best reward | reward curve                          | parameters                                                   | time cost           |
 | --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ | ------------------- |
-| PongNoFrameskip-v4          | 20          | ![](results/dqn/Pong_rew.png)         | `python3 atari_dqn.py --task "PongNoFrameskip-v4" --batch-size 64` | ~30 min (~15 epoch) |
-| BreakoutNoFrameskip-v4      | 316         | ![](results/dqn/Breakout_rew.png)     | `python3 atari_dqn.py --task "BreakoutNoFrameskip-v4" --test-num 100`  | 3~4h (100 epoch)    |
-| EnduroNoFrameskip-v4        | 670         | ![](results/dqn/Enduro_rew.png)       | `python3 atari_dqn.py --task "EnduroNoFrameskip-v4 " --test-num 100`  | 3~4h (100 epoch)    |
-| QbertNoFrameskip-v4         | 7307        | ![](results/dqn/Qbert_rew.png)        | `python3 atari_dqn.py --task "QbertNoFrameskip-v4" --test-num 100`  | 3~4h (100 epoch)    |
-| MsPacmanNoFrameskip-v4      | 2107        | ![](results/dqn/MsPacman_rew.png)     | `python3 atari_dqn.py --task "MsPacmanNoFrameskip-v4" --test-num 100`  | 3~4h (100 epoch)    |
-| SeaquestNoFrameskip-v4      | 2088        | ![](results/dqn/Seaquest_rew.png)     | `python3 atari_dqn.py --task "SeaquestNoFrameskip-v4" --test-num 100`  | 3~4h (100 epoch)    |
-| SpaceInvadersNoFrameskip-v4 | 812.2       | ![](results/dqn/SpaceInvader_rew.png) | `python3 atari_dqn.py --task "SpaceInvadersNoFrameskip-v4" --test-num 100`  | 3~4h (100 epoch)    |
+| PongNoFrameskip-v4          | 20          | ![](results/dqn/Pong_rew.png)         | `python3 atari_dqn.py --task "PongNoFrameskip-v4" --batch_size 64` | ~30 min (~15 epoch) |
+| BreakoutNoFrameskip-v4      | 316         | ![](results/dqn/Breakout_rew.png)     | `python3 atari_dqn.py --task "BreakoutNoFrameskip-v4" --num_test_envs 100`  | 3~4h (100 epoch)    |
+| EnduroNoFrameskip-v4        | 670         | ![](results/dqn/Enduro_rew.png)       | `python3 atari_dqn.py --task "EnduroNoFrameskip-v4 " --num_test_envs 100`  | 3~4h (100 epoch)    |
+| QbertNoFrameskip-v4         | 7307        | ![](results/dqn/Qbert_rew.png)        | `python3 atari_dqn.py --task "QbertNoFrameskip-v4" --num_test_envs 100`  | 3~4h (100 epoch)    |
+| MsPacmanNoFrameskip-v4      | 2107        | ![](results/dqn/MsPacman_rew.png)     | `python3 atari_dqn.py --task "MsPacmanNoFrameskip-v4" --num_test_envs 100`  | 3~4h (100 epoch)    |
+| SeaquestNoFrameskip-v4      | 2088        | ![](results/dqn/Seaquest_rew.png)     | `python3 atari_dqn.py --task "SeaquestNoFrameskip-v4" --num_test_envs 100`  | 3~4h (100 epoch)    |
+| SpaceInvadersNoFrameskip-v4 | 812.2       | ![](results/dqn/SpaceInvader_rew.png) | `python3 atari_dqn.py --task "SpaceInvadersNoFrameskip-v4" --num_test_envs 100`  | 3~4h (100 epoch)    |
 
 Note: The `eps_train_final` and `eps_test` in the original DQN paper is 0.1 and 0.01, but [some works](https://github.com/google/dopamine/tree/master/baselines) found that smaller eps helps improve the performance. Also, a large batchsize (say 64 instead of 32) will help faster convergence but will slow down the training speed. 
 
@@ -42,7 +42,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
 
 | task                        | best reward | reward curve                          | parameters                                                   |
 | --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
-| PongNoFrameskip-v4          | 20          | ![](results/c51/Pong_rew.png)         | `python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64` |
+| PongNoFrameskip-v4          | 20          | ![](results/c51/Pong_rew.png)         | `python3 atari_c51.py --task "PongNoFrameskip-v4" --batch_size 64` |
 | BreakoutNoFrameskip-v4      | 536.6         | ![](results/c51/Breakout_rew.png)     | `python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
 | EnduroNoFrameskip-v4        | 1032         | ![](results/c51/Enduro_rew.png)       | `python3 atari_c51.py --task "EnduroNoFrameskip-v4 " ` |
 | QbertNoFrameskip-v4         | 16245        | ![](results/c51/Qbert_rew.png)        | `python3 atari_c51.py --task "QbertNoFrameskip-v4"`  |
@@ -58,7 +58,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
 
 | task                        | best reward | reward curve                          | parameters                                                   |
 | --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
-| PongNoFrameskip-v4          | 20          | ![](results/qrdqn/Pong_rew.png)         | `python3 atari_qrdqn.py --task "PongNoFrameskip-v4" --batch-size 64` |
+| PongNoFrameskip-v4          | 20          | ![](results/qrdqn/Pong_rew.png)         | `python3 atari_qrdqn.py --task "PongNoFrameskip-v4" --batch_size 64` |
 | BreakoutNoFrameskip-v4      | 409.2         | ![](results/qrdqn/Breakout_rew.png)     | `python3 atari_qrdqn.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
 | EnduroNoFrameskip-v4      | 1055.9        | ![](results/qrdqn/Enduro_rew.png)     | `python3 atari_qrdqn.py --task "EnduroNoFrameskip-v4"`  |
 | QbertNoFrameskip-v4         | 14990        | ![](results/qrdqn/Qbert_rew.png)        | `python3 atari_qrdqn.py --task "QbertNoFrameskip-v4"`  |
@@ -72,7 +72,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
 
 | task                        | best reward | reward curve                          | parameters                                                   |
 | --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
-| PongNoFrameskip-v4          | 20.3        | ![](results/iqn/Pong_rew.png)         | `python3 atari_iqn.py --task "PongNoFrameskip-v4" --batch-size 64` |
+| PongNoFrameskip-v4          | 20.3        | ![](results/iqn/Pong_rew.png)         | `python3 atari_iqn.py --task "PongNoFrameskip-v4" --batch_size 64` |
 | BreakoutNoFrameskip-v4      | 496.7       | ![](results/iqn/Breakout_rew.png)     | `python3 atari_iqn.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
 | EnduroNoFrameskip-v4        | 1545        | ![](results/iqn/Enduro_rew.png)       | `python3 atari_iqn.py --task "EnduroNoFrameskip-v4"`  |
 | QbertNoFrameskip-v4         | 15342.5     | ![](results/iqn/Qbert_rew.png)        | `python3 atari_iqn.py --task "QbertNoFrameskip-v4"`  |
@@ -86,7 +86,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
 
 | task                        | best reward | reward curve                          | parameters                                                   |
 | --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
-| PongNoFrameskip-v4          | 20.7        | ![](results/fqf/Pong_rew.png)         | `python3 atari_fqf.py --task "PongNoFrameskip-v4" --batch-size 64` |
+| PongNoFrameskip-v4          | 20.7        | ![](results/fqf/Pong_rew.png)         | `python3 atari_fqf.py --task "PongNoFrameskip-v4" --batch_size 64` |
 | BreakoutNoFrameskip-v4      | 517.3       | ![](results/fqf/Breakout_rew.png)     | `python3 atari_fqf.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
 | EnduroNoFrameskip-v4        | 2240.5      | ![](results/fqf/Enduro_rew.png)       | `python3 atari_fqf.py --task "EnduroNoFrameskip-v4"`  |
 | QbertNoFrameskip-v4         | 16172.5     | ![](results/fqf/Qbert_rew.png)        | `python3 atari_fqf.py --task "QbertNoFrameskip-v4"`  |
@@ -100,7 +100,7 @@ One epoch here is equal to 100,000 env step, 100 epochs stand for 10M.
 
 | task                        | best reward | reward curve                          | parameters                                                   |
 | --------------------------- | ----------- | ------------------------------------- | ------------------------------------------------------------ |
-| PongNoFrameskip-v4          | 21        | ![](results/rainbow/Pong_rew.png)         | `python3 atari_rainbow.py --task "PongNoFrameskip-v4" --batch-size 64` |
+| PongNoFrameskip-v4          | 21        | ![](results/rainbow/Pong_rew.png)         | `python3 atari_rainbow.py --task "PongNoFrameskip-v4" --batch_size 64` |
 | BreakoutNoFrameskip-v4      | 684.6        | ![](results/rainbow/Breakout_rew.png)     | `python3 atari_rainbow.py --task "BreakoutNoFrameskip-v4" --n-step 1` |
 | EnduroNoFrameskip-v4        | 1625.9       | ![](results/rainbow/Enduro_rew.png)       | `python3 atari_rainbow.py --task "EnduroNoFrameskip-v4"`  |
 | QbertNoFrameskip-v4         | 16192.5     | ![](results/rainbow/Qbert_rew.png)        | `python3 atari_rainbow.py --task "QbertNoFrameskip-v4"`  |