How is the stochastic iteration $q_{k+1} = q_k - \alpha_k e(q_k)$ derived? How is it applied to TD-learning? How are the validity conditions on $e(q)$ satisfied? Ref: questions on Moodle [1](https://moodle-app2.let.ethz.ch/mod/forum/discuss.php?d=127852), [2](https://moodle-app2.let.ethz.ch/mod/forum/discuss.php?d=127886)