rl_to_learn

Learning is a reinforcement-learning problem.

We measure our states: current weights, current gradients, maybe histories, from sampled minibatches.
We evaluate our policy based on the states: the policy network. We get \Delta w for the value network.
We sample the reward (i.e. loss) while using w + \Delta w as the weights of the value network.
We do gradient ascent on the policy network weights.
We apply the \Delta w to the value network weights.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
playground/mountain_car		playground/mountain_car
.env		.env
.gitignore		.gitignore
README.md		README.md
mnist.ipynb		mnist.ipynb

Provide feedback