0


基于样本的学习的搜索算法是永久学习和瞬态记忆分离

Sample-Based Learning and Search with Permanent and Transient Memories
课程网址: http://videolectures.net/icml08_silver_sbl/  
主讲教师: David Silver
开课单位: 伦敦大学学院
开课时间: 2008-08-12
课程语种: 英语
中文简介:
我们提出了一个强化学习架构,Dyna 2,它包括基于样本的学习和基于样本的搜索,并且在学习和搜索期间在各州进行概括。我们将Dyna 2应用于高性能计算机Go。在该领域中,最成功的规划方法基于基于样本的搜索算法,例如UCT,其中状态被单独处理,并且最成功的学习方法基于时间差异学习算法,例如Sarsa,其中线性函数近似。用来。在这两种情况下,都会形成值函数的估计,但在第一种情况下,它是瞬态的,计算的,然后在每次移动后丢弃,而在第二种情况下,它更永久,在许多移动和游戏中缓慢累积。 Dyna 2的想法是将瞬态计划记忆和永久学习记忆保持分离,但两者都要基于线性函数近似,并且两者都要由Sarsa更新。为了将Dyna 2应用于9x9 Computer Go,我们在函数逼近器中使用了一百万个二进制特征,基于匹配电路板小片段的模板。仅使用瞬态记忆,Dyna 2的表现至少与UCT一样好。结合两种记忆,它明显优于UCT。我们基于Dyna 2的程序在计算机在线服务器上获得的评级高于任何手工或传统的基于搜索的程序。
课程简介: We present a reinforcement learning architecture, Dyna-2, that encompasses both sample-based learning and sample-based search, and that generalises across states during both learning and search. We apply Dyna-2 to high performance Computer Go. In this domain the most successful planning methods are based on sample-based search algorithms, such as UCT, in which states are treated individually, and the most successful learning methods are based on temporal-difference learning algorithms, such as Sarsa, in which linear function approximation is used. In both cases, an estimate of the value function is formed, but in the first case it is transient, computed and then discarded after each move, whereas in the second case it is more permanent, slowly accumulating over many moves and games. The idea of Dyna-2 is for the transient planning memory and the permanent learning memory to remain separate, but for both to be based on linear function approximation and both to be updated by Sarsa. To apply Dyna-2 to 9x9 Computer Go, we use a million binary features in the function approximator, based on templates matching small fragments of the board. Using only the transient memory, Dyna-2 performed at least as well as UCT. Using both memories combined, it significantly outperformed UCT. Our program based on Dyna-2 achieved a higher rating on the Computer Go Online Server than any handcrafted or traditional search based program.
关 键 词: 强化学习架构; 计算机; 搜索算法
课程来源: 视频讲座网
最后编审: 2019-04-21:lxf
阅读次数: 57