Skip to content

Commit de5893f

Browse files
Update Docs for v0.11 release (#1056)
* update run function * update docs * fix naming, update docs * fix random walk example * Add wrapper test * update docs * update docs * fix / update docs * bump trajectories * fix tests * add player type * syntax * migrate tictactoe * fix import * Add RLCore as dependency to RLEnvs * update player * Fix tests * Fix player state in abstract_learner.jl * type annotations * Add PlayerNamedTuple * Fix files * Simplify Player syntax * symbol -> player * Fix tests * fix * Move player struct * Fix tests * Fix typo * Fix * Fix player * Fix test * Fix Poker * Fix wrapper * Fix tests * Fix naming * Fix env tests * Fix KuhnPoker * Fix env * Fix type ambiguity * Fix pigenv * Fix tic tac toe * Fix errors --------- Co-authored-by: Jeremiah Lewis <--get>
1 parent 18b1e1f commit de5893f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+566
-416
lines changed

Project.toml

+2-2
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@ ReinforcementLearningEnvironments = "25e41dd2-4622-11e9-1641-f1adca772921"
1212

1313
[compat]
1414
Reexport = "0.2, 1"
15-
ReinforcementLearningBase = "0.12"
15+
ReinforcementLearningBase = "0.13"
1616
ReinforcementLearningCore = "0.15"
17-
ReinforcementLearningEnvironments = "0.8"
17+
ReinforcementLearningEnvironments = "0.9"
1818
julia = "1.6"
1919

2020
[extras]

docs/Project.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[deps]
22
ArcadeLearningEnvironment = "b7f77d8d-088d-5e02-8ac0-89aab2acc977"
3-
BSON = "fbb218c0-5317-5bc6-957e-2ee96dd4b1f0"
3+
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
44
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
55
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
66
DemoCards = "311a05b2-6137-4a5a-b473-18580a3d38b5"

docs/homepage/guide/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ Usually a closure or a functional object will be used to store some intermediate
8585
In most cases, you don't need to write a customized hook. Some generic hooks are provided so that you can inject logic at the appropriate time:
8686

8787
- [`DoEveryNSteps`](https://juliareinforcementlearning.org/ReinforcementLearning.jl/latest/rl_core/#ReinforcementLearningCore.DoEveryNSteps)
88-
- [`DoEveryNEpisode`](https://juliareinforcementlearning.org/ReinforcementLearning.jl/latest/rl_core/#ReinforcementLearningCore.DoEveryNEpisode)
88+
- [`DoEveryNEpisodes`](https://juliareinforcementlearning.org/ReinforcementLearning.jl/latest/rl_core/#ReinforcementLearningCore.DoEveryNEpisodes)
8989

9090
However, if you do need to write a customized hook, the following methods must be provided:
9191

docs/src/How_to_implement_a_new_algorithm.md

+16-17
Original file line numberDiff line numberDiff line change
@@ -10,43 +10,42 @@ function _run(policy::AbstractPolicy,
1010
stop_condition::AbstractStopCondition,
1111
hook::AbstractHook,
1212
reset_condition::AbstractResetCondition)
13-
1413
push!(policy, PreExperimentStage(), env)
1514
is_stop = false
1615
while !is_stop
1716
reset!(env)
1817
push!(policy, PreEpisodeStage(), env)
1918
optimise!(policy, PreEpisodeStage())
2019

21-
while !reset_condition(policy, env) # one episode
20+
while !check!(reset_condition, policy, env) # one episode
2221
push!(policy, PreActStage(), env)
2322
optimise!(policy, PreActStage())
2423

25-
RLBase.plan!(policy, env)
24+
action = RLBase.plan!(policy, env)
2625
act!(env, action)
2726

2827
push!(policy, PostActStage(), env, action)
2928
optimise!(policy, PostActStage())
3029

31-
if check_stop(stop_condition, policy, env)
30+
if check!(stop_condition, policy, env)
3231
is_stop = true
3332
break
3433
end
3534
end # end of an episode
3635

3736
push!(policy, PostEpisodeStage(), env)
3837
optimise!(policy, PostEpisodeStage())
38+
3939
end
4040
push!(policy, PostExperimentStage(), env)
4141
hook
4242
end
43-
4443
```
4544

4645
Implementing a new algorithm mainly consists of creating your own `AbstractPolicy` (or `AbstractLearner`, see [this section](#using-resources-from-rlcore)) subtype, its action sampling method (by overloading `Base.push!(policy::YourPolicyType, env)`) and implementing its behavior at each stage. However, ReinforcemementLearning.jl provides plenty of pre-implemented utilities that you should use to 1) have less code to write 2) lower the chances of bugs and 3) make your code more understandable and maintainable (if you intend to contribute your algorithm).
4746

4847
## Using Agents
49-
The recommended way is to use the policy wrapper `Agent`. An agent is itself an `AbstractPolicy` that wraps a policy and a trajectory (also called Experience Replay Buffer in RL literature). Agent comes with default implementations of `push!(agent, stage, env)` and `plan!(agent, env)` that will probably fit what you need at most stages so that you don't have to write them again. Looking at the [source code](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningCore/src/policies/agent.jl/), we can see that the default Agent calls are
48+
The recommended way is to use the policy wrapper `Agent`. An agent is itself an `AbstractPolicy` that wraps a policy and a trajectory (also called Experience Replay Buffer in reinforcement learning literature). Agent comes with default implementations of `push!(agent, stage, env)` and `plan!(agent, env)` that will probably fit what you need at most stages so that you don't have to write them again. Looking at the [source code](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningCore/src/policies/agent.jl/), we can see that the default Agent calls are
5049

5150
```julia
5251
function Base.push!(agent::Agent, ::PreEpisodeStage, env::AbstractEnv)
@@ -61,21 +60,21 @@ end
6160

6261
The function `RLBase.plan!(agent::Agent, env::AbstractEnv)`, is called at the `action = RLBase.plan!(policy, env)` line. It simply gets an action from the policy of the agent by calling `RLBase.plan!(your_new_policy, env)` function. At the `PreEpisodeStage()`, the agent pushes the initial state to the trajectory. At the `PostActStage()`, the agent pushes the transition to the trajectory.
6362

64-
If you need a different behavior at some stages, then you can overload the `Base.push!(Agent{<:YourPolicyType}, [stage,] env)` or `Base.push!(Agent{<:Any, <: YourTrajectoryType}, [stage,] env)`, or `Base.plan!`, depending on whether you have a custom policy or just a custom trajectory. For example, many algorithms (such as PPO) need to store an additional trace of the logpdf of the sampled actions and thus overload the function at the `PreActStage()`.
63+
If you need a different behavior at some stages, then you can overload the `Base.push!(Agent{<:YourPolicyType}, [stage,] env)` or `Base.push!(Agent{<:Any, <: YourTrajectoryType}, [stage,] env)`, or `Base.plan!`, depending on whether you have a custom policy or just a custom trajectory. For example, many algorithms (such as PPO) need to store an additional trace of the `logpdf` of the sampled actions and thus overload the function at the `PreActStage()`.
6564

6665
## Updating the policy
6766

6867
Finally, you need to implement the learning function by implementing `RLBase.optimise!(::YourPolicyType, ::Stage, ::Trajectory)`. By default this does nothing at all stages. Overload it on the stage where you wish to optimise (most often, at `PostActStage()` or `PostEpisodeStage()`). This function should loop the trajectory to sample batches. Inside the loop, put whatever is required. For example:
6968

7069
```julia
71-
function RLBase.optimise!(p::YourPolicyType, ::PostEpisodeStage, traj::Trajectory)
72-
for batch in traj
73-
optimise!(p, batch)
70+
function RLBase.optimise!(policy::YourPolicyType, ::PostEpisodeStage, trajectory::Trajectory)
71+
for batch in trajectory
72+
optimise!(policy, batch)
7473
end
7574
end
7675

7776
```
78-
where `optimise!(p, batch)` is a function that will typically compute the gradient and update a neural network, or update a tabular policy. What is inside the loop is free to be whatever you need but it's a good idea to implement a `optimise!(p::YourPolicyType, batch::NamedTuple)` function for clarity instead of coding everything in the loop. This is further discussed in the next section on `Trajectory`s. An example of where this could be different is when you want to update priorities, see [the PER learner](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningZoo/src/algorithms/dqns/prioritized_dqn.jl) for an example.
77+
where `optimise!(policy, batch)` is a function that will typically compute the gradient and update a neural network, or update a tabular policy. What is inside the loop is free to be whatever you need but it's a good idea to implement a `optimise!(policy::YourPolicyType, batch::NamedTuple)` function for clarity instead of coding everything in the loop. This is further discussed in the next section on `Trajectory`s.
7978

8079
## ReinforcementLearningTrajectories
8180

@@ -112,13 +111,13 @@ ReinforcementLearningTrajectories' design aims to eventually support distributed
112111

113112
The sampler is the object that will fetch data in your trajectory to create the `batch` in the optimise for loop. The simplest one is the `BatchSampler{names}(batchsize, rng)`.`batchsize` is the number of elements to sample and `rng` is an optional argument that you may set to a custom rng for reproducibility. `names` is the set of traces the sampler must query. For example a `BatchSampler{(:state, :action, :next_state)}(32)` will sample a named tuple `(state = [32 states], action=[32 actions], next_state=[32 states that are one-off with respect that in state])`.
114113

115-
## Using resources from RLCore
114+
## Using resources from ReinforcementLearningCore
116115

117-
RL algorithms typically only differ partially but broadly use the same mechanisms. The subpackage RLCore contains some modules that you can reuse to implement your algorithm.
118-
These will take care of many aspects of training for you. See the [RLCore manual](./rlcore.md)
116+
RL algorithms typically only differ partially but broadly use the same mechanisms. The subpackage ReinforcementLearningCore contains some modules that you can reuse to implement your algorithm.
117+
These will take care of many aspects of training for you. See the [ReinforcementLearningCore manual](./rlcore.md)
119118

120119
### Utils
121-
In utils/distributions.jl you will find implementations of gaussian log probabilities functions that are both GPU compatible and differentiable and that do not require the overhead of using Distributions.jl structs.
120+
In `utils/distributions.jl` you will find implementations of gaussian log probabilities functions that are both GPU compatible and differentiable and that do not require the overhead of using `Distributions.jl` structs.
122121

123122
## Conventions
124123
Finally, there are a few "conventions" and good practices that you should follow, especially if you intend to contribute to this package (don't worry we'll be happy to help if needed).
@@ -127,9 +126,9 @@ Finally, there are a few "conventions" and good practices that you should follow
127126
ReinforcementLearning.jl aims to provide a framework for reproducible experiments. To do so, make sure that your policy type has a `rng` field and that all random operations (e.g. action sampling) use `rand(your_policy.rng, args...)`. For trajectory sampling, you can set the sampler's rng to that of the policy when creating and agent or simply instantiate its own rng.
128127

129128
### GPU compatibility
130-
Deep RL algorithms are often much faster when the neural nets are updated on a GPU. For now, we only support CUDA.jl as a backend. This means that you will have to think about the transfer of data between the CPU (where the trajectory is) and the GPU memory (where the neural nets are). To do so you will find in utils/device.jl some functions that do most of the work for you. The ones that you need to know are `send_to_device(device, data)` that sends data to the specified device, `send_to_host(data)` which sends data to the CPU memory (it fallbacks to `send_to_device(Val{:cpu}, data)`) and `device(x)` that returns the device on which `x` is.
129+
Deep RL algorithms are often much faster when the neural nets are updated on a GPU. This means that you will have to think about the transfer of data between the CPU (where the trajectory is) and the GPU memory (where the neural nets are). `Flux.jl` offers `gpu` and `cpu` functions to make it easier to send data back and forth.
131130
Normally, you should be able to write a single implementation of your algorithm that works on CPU and GPUs thanks to the multiple dispatch offered by Julia.
132131

133-
GPU friendlyness will also require that your code does not use _scalar indexing_ (see the CUDA.jl documentation for more information), make sure to test your algorithm on the GPU after disallowing scalar indexing by using `CUDA.allowscalar(false)`.
132+
GPU friendliness will also require that your code does not use _scalar indexing_ (see the `CUDA.jl` or `Metal.jl` documentation for more information); when using `CUDA.jl` make sure to test your algorithm on the GPU after disallowing scalar indexing by using `CUDA.allowscalar(false)`.
134133

135134
Finally, it is a good idea to implement the `Flux.gpu(yourpolicy)` and `cpu(yourpolicy)` functions, for user convenience. Be careful that sampling on the GPU requires a specific type of rng, you can generate one with `CUDA.default_rng()`

docs/src/How_to_use_hooks.md

+40-73
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,12 @@ programming. We write the code in a loop and execute them step by step.
88

99
```julia
1010
while true
11-
env |> policy |> env
11+
action = plan!(policy, env)
12+
act!(env, action)
13+
1214
# write your own logic here
1315
# like saving parameters, recording loss function, evaluating policy, etc.
14-
stop_condition(env, policy) && break
16+
check!(stop_condition, env, policy) && break
1517
is_terminated(env) && reset!(env)
1618
end
1719
```
@@ -30,18 +32,19 @@ execution pipeline. However, we believe this is not necessary in Julia. With the
3032
declarative programming approach, we gain much more flexibilities.
3133

3234
Now the question is how to design the hook. A natural choice is to wrap the
33-
comments part in the above pseudocode into a function:
35+
comments part in the above pseudo-code into a function:
3436

3537
```julia
3638
while true
37-
env |> policy |> env
38-
hook(policy, env)
39-
stop_condition(env, policy) && break
39+
action = plan!(policy, env)
40+
act!(env, action)
41+
push!(hook, policy, env)
42+
check!(stop_condition, env, policy) && break
4043
is_terminated(env) && reset!(env)
4144
end
4245
```
4346

44-
But sometimes, we'd like to have a more fingrained control. So we split the calling
47+
But sometimes, we'd like to have a more fine-grained control. So we split the calling
4548
of hooks into several different stages:
4649

4750
- [`PreExperimentStage`](@ref)
@@ -54,20 +57,22 @@ of hooks into several different stages:
5457
## How to define a customized hook?
5558

5659
By default, an instance of [`AbstractHook`](@ref) will do nothing when called
57-
with `(hook::AbstractHook)(::AbstractStage, policy, env)`. So when writing a
60+
with `push!(hook::AbstractHook, ::AbstractStage, policy, env)`. So when writing a
5861
customized hook, you only need to implement the necessary runtime logic.
5962

6063
For example, assume we want to record the wall time of each episode.
6164

6265
```@repl how_to_use_hooks
6366
using ReinforcementLearning
67+
import Base.push!
6468
Base.@kwdef mutable struct TimeCostPerEpisode <: AbstractHook
6569
t::UInt64 = time_ns()
6670
time_costs::Vector{UInt64} = []
6771
end
68-
(h::TimeCostPerEpisode)(::PreEpisodeStage, policy, env) = h.t = time_ns()
69-
(h::TimeCostPerEpisode)(::PostEpisodeStage, policy, env) = push!(h.time_costs, time_ns()-h.t)
72+
Base.push!(h::TimeCostPerEpisode, ::PreEpisodeStage, policy, env) = h.t = time_ns()
73+
Base.push!(h::TimeCostPerEpisode, ::PostEpisodeStage, policy, env) = push!(h.time_costs, time_ns()-h.t)
7074
h = TimeCostPerEpisode()
75+
7176
run(RandomPolicy(), CartPoleEnv(), StopAfterNEpisodes(10), h)
7277
h.time_costs
7378
```
@@ -77,14 +82,13 @@ h.time_costs
7782
- [`StepsPerEpisode`](@ref)
7883
- [`RewardsPerEpisode`](@ref)
7984
- [`TotalRewardPerEpisode`](@ref)
80-
- [`TotalBatchRewardPerEpisode`](@ref)
8185

8286
## Periodic jobs
8387

8488
Sometimes, we'd like to periodically run some functions. Two handy hooks are
8589
provided for this kind of tasks:
8690

87-
- [`DoEveryNEpisode`](@ref)
91+
- [`DoEveryNEpisodes`](@ref)
8892
- [`DoEveryNSteps`](@ref)
8993

9094
Following are some typical usages.
@@ -98,7 +102,7 @@ run(
98102
policy,
99103
CartPoleEnv(),
100104
StopAfterNEpisodes(100),
101-
DoEveryNEpisode(;n=10) do t, policy, env
105+
DoEveryNEpisodes(;n=10) do t, policy, env
102106
# In real world cases, the policy is usually wrapped in an Agent,
103107
# we need to extract the inner policy to run it in the *actor* mode.
104108
# Here for illustration only, we simply use the original policy.
@@ -117,40 +121,33 @@ run(
117121

118122
### Save parameters
119123

120-
[BSON.jl](https://github.com/JuliaIO/BSON.jl) is recommended to save the parameters of a policy.
124+
[JLD2.jl](https://github.com/JuliaIO/JLD2.jl) is recommended to save the parameters of a policy.
121125

122126
```@repl how_to_use_hooks
123-
using Flux
124-
using Flux.Losses: huber_loss
125-
using BSON
127+
using ReinforcementLearning
128+
using JLD2
126129
127-
env = CartPoleEnv(; T = Float32)
128-
ns, na = length(state(env)), length(action_space(env))
130+
env = RandomWalk1D()
131+
ns, na = length(state_space(env)), length(action_space(env))
129132
130133
policy = Agent(
131-
policy = QBasedPolicy(
132-
learner = BasicDQNLearner(
133-
approximator = NeuralNetworkApproximator(
134-
model = Chain(
135-
Dense(ns, 128, relu; init = glorot_uniform),
136-
Dense(128, 128, relu; init = glorot_uniform),
137-
Dense(128, na; init = glorot_uniform),
138-
) |> cpu,
139-
optimizer = Adam(),
140-
),
141-
batchsize = 32,
142-
min_replay_history = 100,
143-
loss_func = huber_loss,
144-
),
145-
explorer = EpsilonGreedyExplorer(
146-
kind = :exp,
147-
ϵ_stable = 0.01,
148-
decay_steps = 500,
134+
QBasedPolicy(;
135+
learner = TDLearner(
136+
TabularQApproximator(n_state = ns, n_action = na),
137+
:SARS;
149138
),
139+
explorer = EpsilonGreedyExplorer(ϵ_stable=0.01),
150140
),
151-
trajectory = CircularArraySARTTrajectory(
152-
capacity = 1000,
153-
state = Vector{Float32} => (ns,),
141+
Trajectory(
142+
CircularArraySARTSTraces(;
143+
capacity = 1,
144+
state = Int64 => (),
145+
action = Int64 => (),
146+
reward = Float64 => (),
147+
terminal = Bool => (),
148+
),
149+
DummySampler(),
150+
InsertSampleRatioController(),
154151
),
155152
)
156153
@@ -161,40 +158,10 @@ run(
161158
env,
162159
StopAfterNSteps(10_000),
163160
DoEveryNSteps(n=1_000) do t, p, e
164-
ps = params(p)
165-
f = joinpath(parameters_dir, "parameters_at_step_$t.bson")
166-
BSON.@save f ps
161+
ps = policy.policy.learner.approximator
162+
f = joinpath(parameters_dir, "parameters_at_step_$t.jld2")
163+
JLD2.@save f ps
167164
println("parameters at step $t saved to $f")
168165
end
169166
)
170167
```
171-
172-
### Logging data
173-
174-
Below we demonstrate how to use
175-
[TensorBoardLogger.jl](https://github.com/PhilipVinc/TensorBoardLogger.jl) to
176-
log runtime metrics. But users could also other tools like
177-
[wandb](https://wandb.ai/site) through
178-
[PyCall.jl](https://github.com/JuliaPy/PyCall.jl).
179-
180-
181-
```@repl how_to_use_hooks
182-
using TensorBoardLogger
183-
using Logging
184-
tf_log_dir = "logs"
185-
lg = TBLogger(tf_log_dir, min_level = Logging.Info)
186-
total_reward_per_episode = TotalRewardPerEpisode()
187-
hook = ComposedHook(
188-
total_reward_per_episode,
189-
DoEveryNEpisode() do t, agent, env
190-
with_logger(lg) do
191-
@info "training" reward = total_reward_per_episode.rewards[end]
192-
end
193-
end
194-
)
195-
run(RandomPolicy(), CartPoleEnv(), StopAfterNEpisodes(50), hook)
196-
readdir(tf_log_dir)
197-
```
198-
199-
Then run `tensorboard --logdir logs` and open the link on the screen in your
200-
browser. (Obviously you need to install tensorboard first.)

0 commit comments

Comments
 (0)