未来信息最小化为PAC贝叶斯正则化的强化学习Future Information Minimization as PAC Bayes regularization in Reinforcement Learning |
|
课程网址: | http://videolectures.net/nipsworkshops2011_tishby_refinement/ |
主讲教师: | Naftali Tishby |
开课单位: | 希伯来大学 |
开课时间: | 2012-01-25 |
课程语种: | 英语 |
中文简介: | 生物体与其环境之间的相互作用通常在马尔可夫决策过程(MDP)的框架内进行处理。虽然标准MDP旨在最大化预期的未来奖励(价值),但通常会忽略代理与其环境之间的循环信息流。特别是,通过感知和行动选择过程中涉及的信息从环境中获得的信息不在标准MDP设置中处理。在本次演讲中,我们将重点放在控制信息上,并展示如何以统一的方式将其与奖励措施相结合。这两种测量都满足熟悉的Bellman递推方程,它们的线性组合(自由能)提供了一个有趣的新优化标准。使用我们的INFO-RL算法探索了价值和信息之间的权衡,为随机(软)政策提供了原则性的理由。通过应用PAC-Bayes泛化界限,这些最优策略也显示出对奖励值的不确定性具有鲁棒性。因此,相同的PAC-Bayesian边界项在信息-RL形式主义中起到信息增益的双重作用,并且在过程学习中作为模型顺序正则化术语。 |
课程简介: | Interactions between an organism and its environment are commonly treated in the framework of Markov Decision Processes (MDP). While standard MDP is aimed at maximizing expected future rewards (value), the circular flow of information between the agent and its environment is generally ignored. In particular, the information gained from the environment by means of perception and the information involved in the process of action selection are not treated in the standard MDP setting. In this talk, we focus on the control information and show how it can be combined with the reward measure in a unified way. Both of these measures satisfy the familiar Bellman recursive equations, and their linear combination (the free-energy) provides an interesting new optimization criterion. The tradeoff between value and information, explored using our INFO-RL algorithm, provides a principled justification for stochastic (soft) policies. These optimal policies are also shown to be robust to uncertainties in the reward values by applying the PAC-Bayes generalization bound. The same PAC-Bayesian bounding term thus plays the dual roles of information-gain in the Information-RL formalism and as a model-order regularization term in the learning of the process. |
关 键 词: | 马尔可夫决策过程; 优化准则; 贝叶斯正则约束 |
课程来源: | 视频讲座网 |
最后编审: | 2020-06-01:吴雨秋(课程编辑志愿者) |
阅读次数: | 69 |