You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[^1]: Watkins, C. J. C. H. (1989). *Learning from delayed rewards*. PhD thesis, King's College, Cambridge. — Q-Learning 算法的原始提出,首次定义了动作价值函数 $Q(s,a)$ 并给出了通过试错学习 Q 值的算法框架。
729
+
[^1]: Watkins, C. J. C. H. (1989). _Learning from delayed rewards_. PhD thesis, King's College, Cambridge. — Q-Learning 算法的原始提出,首次定义了动作价值函数 $Q(s,a)$ 并给出了通过试错学习 Q 值的算法框架。
730
730
731
-
[^5]: Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. *Machine Learning*, 8(3), 279–292. https://doi.org/10.1007/BF00992698 — Q-Learning 的收敛性证明,证明了在满足一定条件下 Q-Learning 保证收敛到最优动作价值 $Q^*$。
731
+
[^5]: Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. _Machine Learning_, 8(3), 279–292. https://doi.org/10.1007/BF00992698 — Q-Learning 的收敛性证明,证明了在满足一定条件下 Q-Learning 保证收敛到最优动作价值 $Q^*$。
732
732
733
-
[^2]: Rummery, G. A., & Niranjan, M. (1994). *On-line Q-learning using connectionist systems*. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department. — SARSA 算法的原始论文,提出了 Q-Learning 的 on-policy 版本。
733
+
[^2]: Rummery, G. A., & Niranjan, M. (1994). _On-line Q-learning using connectionist systems_. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department. — SARSA 算法的原始论文,提出了 Q-Learning 的 on-policy 版本。
734
734
735
-
[^3]: Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book.html — 强化学习经典教材,GridWorld 环境的系统介绍和 Q-Learning/SARSA 的详细推导。
735
+
[^3]: Sutton, R. S., & Barto, A. G. (2018). _Reinforcement Learning: An Introduction_ (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book.html — 强化学习经典教材,GridWorld 环境的系统介绍和 Q-Learning/SARSA 的详细推导。
0 commit comments