0


非政策TD的固定点

The Fixed Points of Off-Policy TD
课程网址: http://videolectures.net/nips2011_kolter_fixedpoints/  
主讲教师: J. Zico Kolter
开课单位: 卡内基梅隆大学
开课时间: 2012-09-06
课程语种: 英语
中文简介:
政策外学习,是一个代理人学习政策以外的其他政策的能力,是强化学习的一个关键要素,近年来,在开发时间差异(td)算法方面已经做了很多工作,这些算法保证在政策外抽样下收敛。然而,当采用非政策抽样和函数逼近时,是否可以先验地说TD解决方案的质量问题仍然是一个悬而未决的问题。一般来说,答案是否定的:对于任意的非策略抽样,即使近似值能够很好地表示真值函数,td解的误差也可以无限大。在本文中,我们提出了一种解决这一问题的新方法:我们证明,通过考虑非政策分布的某个凸子集,我们确实可以提供类似于政策外情形的解决质量保证。此外,我们还证明了只使用系统生成的样本就可以有效地投影到这个凸集上。最终的结果是一种新的TD算法,即使在政策外抽样的情况下也具有近似保证,并且在经验上优于现有的TD方法。
课程简介: Off-policy learning, the ability for an agent to learn about a policy other than the one it is following, is a key element of Reinforcement Learning, and in recent years there has been much work on developing Temporal Different (TD) algorithms that are guaranteed to converge under off-policy sampling. It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efficiently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods.
关 键 词: 计算机科学; 机器学习; 强化学习
课程来源: 视频讲座网
最后编审: 2020-04-09:cjy
阅读次数: 43