Learning is a reinforcement-learning problem.
- We measure our states: current weights, current gradients, maybe histories, from sampled minibatches.
- We evaluate our policy based on the states: the policy network. We get \Delta w for the value network.
- We sample the reward (i.e. loss) while using w + \Delta w as the weights of the value network.
- We do gradient ascent on the policy network weights.
- We apply the \Delta w to the value network weights.