首页数学
   首页数学分析
0


Bandit反馈下的在线Markov决策过程

Online Markov Decision Processes under Bandit Feedback
课程网址: http://videolectures.net/nips2010_neu_omd/  
主讲教师: Gergely Neu
开课单位: INRIA研究机构
开课时间: 2011-03-25
课程语种: 英语
中文简介:

我们考虑在有限的随机马尔可夫环境中进行在线学习,在每个随机步长中,每一个步长都会选择新的奖励函数。学习代理的目标是就获得的总奖励而言,与最佳固定策略竞争。在每个时间步骤中,代理都会观察到当前状态以及与上一次转换相关的奖励,但是,代理不会观察到与其他状态动作对相关的奖励。假定代理知道转移概率。此设置的最新结果是一个绝不令人遗憾的算法。在本文中,我们提出了一种新的学习算法,并假设固定策略均匀地快速混合,我们表明在经过T个时间步长后,新算法的预期后悔是O(T ^ {2/3}(ln T)^ {1 / 3}),给出了第一个经过严格证明的收敛速度结果。

课程简介: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O(T^{2/3} (ln T)^{1/3}), giving the first rigorously proved convergence rate result for the problem.
关 键 词: 函数收敛; 学习算法
课程来源: 视频讲座网
数据采集: 2021-03-07:zyk
最后编审: 2021-03-27:yumf
阅读次数: 45