0


重要抽样与似然比策略梯度之间的联系

On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
课程网址: http://videolectures.net/nips2010_tang_cbi/  
主讲教师: Jie Tang
开课单位: 加州大学伯克利分校
开课时间: 2011-03-25
课程语种: 英语
中文简介:
似然比策略梯度法是目前最成功的强化学习算法之一,尤其是物理系统的强化学习。我们描述了如何从重要性抽样的角度推导似然比策略梯度。这一推导强调了似然比方法是如何利用过去的经验的:(a)利用过去的经验来估计当前政策参数化时预期回报的梯度,而不是获得更完整的估计;(b)利用当前政策下的过去经验,而不是利用所有过去的经验来提高效率。证明估算结果。我们提出了一种新的策略搜索方法,它既利用了这些观测值,又利用了广义基线——这是一种将常用的基线技术推广到策略梯度方法中的新技术。我们的算法在几个试验台上优于标准似然比策略梯度算法。
课程简介: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (a) using the past experience to estimate the gradient of the expected return at the current policy parameterization, rather than to obtain a more complete estimate, and (b) using past experience under the current policy rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines - a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds.
关 键 词: 似然比策略梯度方法; 期望回报率; 基线技术
课程来源: 视频讲座网
最后编审: 2021-01-28:nkq
阅读次数: 4