0


线性函数逼近的时差学习快速梯度下降法

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation
课程网址: http://videolectures.net/icml09_sutton_fgdm/  
主讲教师: Richard S. Sutton
开课单位: 阿尔伯塔大学
开课时间: 2009-09-17
课程语种: 英语
中文简介:
Sutton,Szepesvari和Maei(2009)最近引入了第一个与线性函数逼近和离开策略训练兼容的时间差分学习算法,其复杂度仅在函数逼近器的大小上线性缩放。虽然它们的梯度时间差(GTD)算法可靠地收敛,但与传统的线性TD(在TD收敛的政策问题上)相比,它可能非常慢,这使人怀疑其实际效用。在本文中,我们介绍了两种具有更好收敛速度的新相关算法。第一个算法GTD2被推导出来并证明了收敛性与GTD一样,但使用不同的目标函数并且收敛速度明显更快(但仍然没有传统TD那么快)。第二种新算法,即具有梯度校正的线性TD或TDC,使用与常规TD相同的更新规则,除了最初为零的附加项。在我们关于小测试问题的实验和具有一百万个特征的Computer Go应用程序中,该算法的学习速率与传统TD的学习速率相当。该算法似乎将线性TD扩展到关闭策略学习,而没有性能损失,同时仅加倍计算要求。
课程简介: Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.
关 键 词: 线性函数; 时间差分学习算法; 收敛速度
课程来源: 视频讲座网
最后编审: 2019-04-24:lxf
阅读次数: 83