因子MDP中的乐观初始化和贪婪导致多项式时间学习][Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs]_MOOC(慕课)境外开放课程

首页 → 计算机科学技术
首页 → 工程与技术科学

因子MDP中的乐观初始化和贪婪导致多项式时间学习 Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs


课程网址:	https://videolectures.net/videos/icml09_szita_oigl
主讲教师:	Istvan Szita
开课单位:	会议
开课时间:	2009-08-26
课程语种:	英语
中文简介:	本文提出了一种在因子马尔可夫决策过程（FMDP）中进行多项式时间强化学习的算法。因子乐观初始模型（FOIM）算法以传统方式维护FMDP的经验模型，并始终对其模型遵循贪婪策略。该算法的唯一诀窍是模型被乐观地初始化。我们证明了在适当的初始化条件下（i）FOIM收敛到近似值迭代（AVI）的不动点；（ii）当代理做出非近似最优决策时（相对于AVI的解）的步骤数在所有相关量中都是多项式；（iii）算法的每一步成本也是多项式的。据我们所知，FOIM是第一个具有这些特性的算法。
课程简介:	In this paper we propose an algorithm for polynomial-time reinforcement learning in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an empirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized optimistically. We prove that with suitable initialization (i) FOIM converges to the fixed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with respect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the first algorithm with these properties.
关键词:	初始模型; 决策过程; 最优决策
课程来源:	视频讲座网
数据采集:	2025-04-25：liyq
最后编审:	2025-04-25：liyq
阅读次数:	213

服务热线：0574-88229129
电子邮件：info_lib@nbt.edu.cn
信息服务：图书馆305室
系统研发：图书馆303室

图书馆学生服务群：437507696
图书馆教工服务群：1038697975
QQ在线咨询
2013-2026 © 浙大宁波理工学院图书馆