因子MDP中的乐观初始化和贪婪导致多项式时间学习Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs |
|
课程网址: | https://videolectures.net/videos/icml09_szita_oigl |
主讲教师: | Istvan Szita |
开课单位: | 会议 |
开课时间: | 2009-08-26 |
课程语种: | 英语 |
中文简介: | 本文提出了一种在因子马尔可夫决策过程(FMDP)中进行多项式时间强化学习的算法。因子乐观初始模型(FOIM)算法以传统方式维护FMDP的经验模型,并始终对其模型遵循贪婪策略。该算法的唯一诀窍是模型被乐观地初始化。我们证明了在适当的初始化条件下(i)FOIM收敛到近似值迭代(AVI)的不动点;(ii)当代理做出非近似最优决策时(相对于AVI的解)的步骤数在所有相关量中都是多项式;(iii)算法的每一步成本也是多项式的。据我们所知,FOIM是第一个具有这些特性的算法。 |
课程简介: | In this paper we propose an algorithm for polynomial-time reinforcement learning in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an empirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized optimistically. We prove that with suitable initialization (i) FOIM converges to the fixed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with respect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the first algorithm with these properties. |
关 键 词: | 初始模型; 决策过程; 最优决策 |
课程来源: | 视频讲座网 |
数据采集: | 2025-04-25:liyq |
最后编审: | 2025-04-25:liyq |
阅读次数: | 10 |