0


自举技巧

Bootstrapping Skills
课程网址: http://videolectures.net/rldm2015_mankowitz_bootstrapping_skills/  
主讲教师: Daniel Mankowitz
开课单位: 以色列理工学院
开课时间: 2015-07-28
课程语种: 英语
中文简介:
马尔可夫决策过程(MDPs)中策略表示的整体方法寻找一个可以表示为从状态到动作的函数的单一策略。为了使单一方法成功(这并不总是可能的),复杂的特征表示通常是必要的,因为策略是一个复杂的对象,它必须规定在整个状态空间中采取什么操作。这在动态复杂的大型MDP领域尤其如此。在mdp中使用复杂的单片方法进行学习和规划在计算上也是低效的。我们提出了一种不同的方法,在这种方法中,我们将策略空间限制为可以表示为更简单、参数化技能的组合的策略,这是一种临时扩展的操作,具有简单的策略表示。我们通过Bootstrapping(LSB)引入学习技巧,它可以使用广泛的强化学习(RL)算法作为一个“黑匣子”迭代学习参数化技能。最初,学习到的技能是短视的,但算法的每次迭代都允许技能相互引导,从而改进过程中的每项技能。我们证明了这个引导过程返回了一个近似最优的策略。此外,我们的实验证明,LSB可以解决mdp问题,在相同的表示能力下,不能用整体方法求解。因此,使用所学的技能进行规划可以在不需要复杂的策略表示的情况下产生更好的策略。
课程简介: The monolithic approach to policy representation in Markov Decision Processes (MDPs) looks for a single policy that can be represented as a function from states to actions. For the monolithic approach to succeed (and this is not always possible), a complex feature representation is often necessary since the policy is a complex object that has to prescribe what actions to take all over the state space. This is especially true 11 in large-state MDP domains with complicated dynamics. It is also computationally inefficient to both learn and plan in MDPs using a complex monolithic approach. We present a different approach where we restrict the policy space to policies that can be represented as combinations of simpler, parameterized skills—a type of temporally extended action, with a simple policy representation. We introduce Learning Skills via Bootstrapping (LSB) that can use a broad family of Reinforcement Learning (RL) algorithms as a “black box” to iteratively learn parametrized skills. Initially, the learned skills are short-sighted, but each iteration of the algorithm allows the skills to bootstrap off one another, improving each skill in the process. We prove that this bootstrapping process returns a near-optimal policy. Furthermore, our experiments demonstrate that LSB can solve MDPs that, given the same representational power, could not be solved by a monolithic approach. Thus, planning with learned skills results in better policies without requiring complex policy representations.
关 键 词: 算法; 强化学习; 策略
课程来源: 视频讲座网
数据采集: 2020-12-14:yxd
最后编审: 2020-12-14:yxd
阅读次数: 39