0


逃离土拨鼠日

Escaping Groundhog Day
课程网址: http://videolectures.net/rldm2015_macglashan_groundhog_day/  
主讲教师: James MacGlashan
开课单位: 布朗大学
开课时间: 2015-07-28
课程语种: 英语
中文简介:
强化学习的主要方法依赖于一个固定的状态动作空间和奖励函数,而agent正试图最大化这个函数。在训练期间,代理会反复重置为预定义的初始状态或一组初始状态。例如,在经典的RL Mountain Car域中,代理从山谷中的某个点开始,持续到到达山谷顶部,然后重置到同一山谷中的其他地方。在这种体制下学习类似于比尔·默里在1993年电影《土拨鼠日》中所面临的学习问题,在这部电影中,他反复重温同一天的生活,直到他找到最佳策略并逃到第二天。在RL代理的更现实的公式中,每天都是新的一天,可能与前一天有相似之处,但代理从未两次遇到相同的状态。该公式自然适用于机器人学问题,其中机器人被放置在一个以前从未去过的房间中,但在过去看到过具有类似对象的类似房间。我们将此问题形式化为优化从分布中提取的一组环境的学习或规划算法,并在这些设置下给出两组学习结果。首先,我们提出了基于目标的行动优先级,用于学习如何在来自同一发行版的环境培训集中的发行版环境中加速规划。其次,我们提出了样本优化Rademacher复杂度,这是一种形式化的机制,用于评估选择在从分布中提取的训练集上优化的学习算法用于整个分布的风险。
课程简介: The dominant approaches to reinforcement learning rely on a fixed state-action space and reward function that the agent is trying to maximize. During training, the agent is repeatedly reset to a predefined initial state or set of initial states. For example, in the classic RL Mountain Car domain, the agent starts at some point in the valley, continues until it reaches the top of the valley and then resets to somewhere else in the same valley. Learning in this regime is akin to the learning problem faced by Bill Murray in the 1993 movie Groundhog Day in which he repeatedly relives the same day, until he discovers the optimal policy and escapes to the next day. In a more realistic formulation for an RL agent, every day is a new day that may have similarities to the previous day, but the agent never encounters the same state twice. This formulation is a natural fit for robotics problems in which a robot is placed in a room in which it has never previously been, but has seen similar rooms with similar objects in the past. We formalize this problem as optimizing a learning or planning algorithm for a set of environments drawn from a distribution and present two sets of results for learning under these settings. First, we present goal-based action priors for learning how to accelerate planning in environments drawn from the distribution from a training set of environments drawn from the same distribution. Second, we present sample-optimized Rademacher complexity, which is a formal mechanism for assessing the risk in choosing a learning algorithm tuned on a training set drawn from the distribution for use on the entire distribution.
关 键 词: 强化学习; 机器人; 优化的学习算法
课程来源: 视频讲座网
数据采集: 2021-11-27:zkj
最后编审: 2021-11-27:zkj
阅读次数: 40