0


班迪特反馈下的在线马尔可夫决策过程

Online Markov Decision Processes under Bandit Feedback
课程网址: http://videolectures.net/nips2010_neu_omd/  
主讲教师: Gergely Neu
开课单位: 北欧里尔公司
开课时间: 2011-03-25
课程语种: 英语
中文简介:
我们考虑在有限随机马尔可夫环境中的在线学习,其中在每个时间步骤中由不经意的对手选择新的奖励函数。学习代理的目标是在收到的总奖励方面与最佳的固定政策竞争。在每个时间步骤中,代理观察当前状态和与最后转换相关联的奖励,然而,代理不观察与其他状态动作对相关联的奖励。假定代理知道转移概率。该设置的最新结果是无后悔算法。在本文中,我们提出了一种新的学习算法,并假设平稳策略快速混合,我们表明在T时间步之后,新算法的预期遗憾是O(T ^ {2/3}(ln T)^ {1 / 3}),给出了第一个严格证明该问题的收敛速度结果。
课程简介: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O(T^{2/3} (ln T)^{1/3}), giving the first rigorously proved convergence rate result for the problem.
关 键 词: 马尔可夫环境
课程来源: 视频讲座网
最后编审: 2019-07-26:cwx
阅读次数: 49