0


学习感觉运动的数据

Learning About Sensorimotor Data
课程网址: http://videolectures.net/nips2011_sutton_data/  
主讲教师: Sutton Richard S
开课单位: 阿尔伯塔大学
开课时间: 2012-01-25
课程语种: 英语
中文简介:
时间差(td)学习奖励预测是强化学习算法和标准多巴胺模型的基础,基于奖励的大脑学习。这种计算和神经科学思想的融合可能是自赫布突触以来最成功的。它能超越奖励吗?大脑当然可以预测很多事情,而不是奖励——比如在各种行为方式后果的正向模型中——TD方法可以用来做出这些预测。自20世纪90年代以来,使用TD方法学习大量关于许多国家和刺激的预测的想法和优势就显而易见了,但技术问题阻碍了这一愿景的实际实施……直到现在。一个关键的突破是开发了一个新的梯度TD方法系列,该系列方法于2008年在NIPS推出(由Maei、Szepesvari和我自己开发)。使用这些方法和其他想法,我们现在能够从一个物理机器人的单一感测器数据流中以10Hz的频率实时学习数千个非奖励预测。这些预测是暂时性的(范围长达数十秒的预测)、目标导向和政策特遣队。新的算法可以使学习脱离策略并并行进行,从而使在给定时间内可以学习的数量显著增加。我们的有效学习率与计算资源成线性关系。在一台消费型笔记本电脑上,我们可以实时学习上千个预测。在一台更大的电脑上,或者在几年内的一台类似的笔记本电脑上,同样的方法可以学习数百万个关于不同行为方式的有意义的预测。这些预测总体上构成了一个丰富而详细的世界模型,可以支持诸如近似动态规划等规划方法。
课程简介: Temporal-difference (TD) learning of reward predictions underlies both reinforcement-learning algorithms and the standard dopamine model of reward-based learning in the brain. This confluence of computational and neuroscientific ideas is perhaps the most successful since the Hebb synapse. Can it be extended beyond reward? The brain certainly predicts many things other than reward---such as in a forward model of the consequences of various ways of behaving---and TD methods can be used to make these predictions. The idea and advantages of using TD methods to learn large numbers of predictions about many states and stimuli, in parallel, have been apparent since the 1990s, but technical issues have prevented this vision from being practically implemented...until now. A key breakthrough was the development of a new family of gradient-TD methods, introduced at NIPS in 2008 (by Maei, Szepesvari, and myself). Using these methods, and other ideas, we are now able to learn thousands of non-reward predictions in real-time at 10Hz from a single sensorimotor data stream from a physical robot. These predictions are temporally extended (ranging up to tens of seconds of anticipation), goal oriented, and policy contingent. The new algorithms enable learning to be off-policy and in parallel, resulting in dramatic increases in the amount that can be learned in a given amount of time. Our effective learning rate scales linearly with computational resources. On a consumer laptop we can learn thousands of predictions in real-time. On a larger computer, or on a comparable laptop in a few years, the same methods could learn millions of meaningful predictions about different alternate ways of behaving. These predictions in aggregate constitute a rich detailed model of the world that can support planning methods such as approximate dynamic programming.
关 键 词: 时间差分; 神经科学; 线性关系
课程来源: 视频讲座网
最后编审: 2020-06-02:毛岱琦(课程编辑志愿者)
阅读次数: 31