We propose a simple implementation of A2C on one single CPU with a MLP policy.
- When first executing the agent of the workspace, states from
t=0
tot=n_timesteps-1
are computed - When executing the agent a second time, then states from
t=n_timesteps
tot=n_timesteps+n_timesteps-1
are computed - There is thus a missing transition between
n_timesteps-1
andn_timesteps
that never appears in one workspace - To avoid this effect, we:
- Copy the last state of the workspace at the first position through
workspace.copy_n_last_steps(1)
- Then execute the agent from
timestep=1
in the workspace - The resulting workspace now contains states from
n_timesteps-1
ton_timesteps+n_timesteps-2
and the transition is not missing anymore
- Copy the last state of the workspace at the first position through
PYTHONPATH=salina python salina/salina_examples/rl/a2c/mono_cpu/main.py
We first write an Agent
which will read an observation (the observation will be generated by a AutoResetGymAgent
that models an environment) and will write action
, action_probs
and critic
at time t
in the Workspace
:
class A2CAgent(TAgent):
def __init__(self, observation_size, hidden_size, n_actions):
super().__init__()
self.model = nn.Sequential(
nn.Linear(observation_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, n_actions),
)
self.critic_model = nn.Sequential(
nn.Linear(observation_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, 1),
)
def forward(self, t, stochastic, **kwargs):
observation = self.get(("env/env_obs", t))
scores = self.model(observation)
probs = torch.softmax(scores, dim=-1)
critic = self.critic_model(observation).squeeze(-1)
if stochastic:
action = torch.distributions.Categorical(probs).sample()
else:
action = probs.argmax(1)
self.set(("action", t), action)
self.set(("action_probs", t), probs)
self.set(("critic", t), critic)
This agent also has additional forward
arguments that will allow us to control how to execute it (e.g stochastic
or deterministic
mode, ...)
- Environment Agent: The first agent to create is the agent that models the environment. In the A2C case, this agent will automatically reset the environments when reaching a final state:
env_agent = AutoResetGymAgent(
get_class(cfg.algorithm.env), get_arguments(cfg.algorithm.env), n_envs=cfg.algorithm.n_envs
)
Note that this agent takes the function/class name + function/class arguments as argument and will construct the environment by itself (this is need for parallelization)
- Agent at time t: Given the environment agent and the policy agent, we can compose them to obtain an agent that will produce at time
t
both observations, reward, etc... and also action, action_probs, critic
agent = Agents(env_agent, a2c_agent)
- Complete acquisition agent: While the previous agent acts at time
t
, we can obtain an agent that will act over a fullWorkspace
:
agent = TemporalAgent(agent)
- Defining a workspace: Now we can define the workspace on which our agents will be applied:
workspace = salina.Workspace()
Once it is done, the acquisition of a trajectory can be done by just executing: agent(workspace)
.
- Executing the agent at each epoch: When executing the agent over a workspace, it will be executed over timesteps
0
totime_size-1
. Then, at the next execution, the states fromt=time_size
tot=time_size+time_size-1
will be acquired, etc.... It means that some transitions (here the one betweentime_size-1
andtime_size
) will not appear in the workspace since they are split between two different workspaces. To avoid this border effect,salina
allows one to do like this
if epoch > 0:
workspace.copy_n_last_steps(1)
agent(workspace, t=1, n_steps=cfg.algorithm.n_timesteps-1,stochastic=True)
else:
agent(workspace, t=0, n_steps=cfg.algorithm.n_timesteps,stochastic=True)
- Loss computation: To compute the loss, one can get from a workspace the tensors generated by the agents:
critic, done, action_probs, reward, action = workspace[
"critic", "env/done", "action_probs", "env/reward", "action"
]
Each tensor is of size time_size x batch_size x ...
. They thus allow an easy loss computation, making the implementation of any RL algorithm quite easy.
- In the next example, we show how modularity can be used to easily define complex agents without rewritting the base learning algorithm
- In (here), we show that any agent can be parallelized over multiple processes
- In (here), we show how to use GPU for speeding up computation
- ...