0


多武装土匪问题中方差估计的应用

Use of variance estimation in the multi-armed bandit problem
课程网址: http://videolectures.net/otee06_audibert_uvema/  
主讲教师: Jean Yves Audibert
开课单位: 国立巴黎高等矿业学院
开课时间: 2007-02-25
课程语种: 英语
中文简介:
大多数决策问题的一个重要方面涉及开发(根据迄今为止获得的部分知识最佳地行动)与环境探索之间的适当平衡(为了优化当前知识和改进未来决策而次优地行动)。这种所谓的探索与开发困境的典型例子是多臂强盗问题,已经制定了许多战略。在这里,我们基于具有最高置信区间的臂的选择来研究策略,其中该约束考虑了不同臂的经验方差。之前发现的这种算法在一系列数值实验中表现优于同行。本文的主要贡献是对该算法的理论研究。我们在这里的贡献是双重的。首先,我们证明概率至少为1− B,在播放UCB算法的一个变体(称为B-UCB)之后的遗憾是一个常数上限,它与log(1 / B)线性比例,但是与n无关。我们还分析了一个更接近前面提出的算法的变体。我们证明了对该算法的预期遗憾的对数界限,并且认为这种界限有利于次优臂的方差。
课程简介: An important aspect of most decision making problems concerns the appropriate balance between exploitation (acting optimally according to the partial knowledge acquired so far) and exploration of the environment (acting sub-optimally in order to refine the current knowledge and improve future decisions). A typical example of this so-called exploration versus exploitation dilemma is the multi-armed bandit problem, for which many strategies have been developed. Here we investigate policies based the choice of the arm having the highest upper-confidence bound, where the bound takes into account the empirical variance of the different arms. Such an algorithm was found earlier to outperform its peers in a series of numerical experiments. The main contribution of this paper is the theoretical investigation of this algorithm. Our contribution here is twofold. First, we prove that with probability at least 1 − B, the regret after n plays of a variant of the UCB algorithm (called B-UCB) is upper-bounded by a constant, that scales linearly with log(1/B), but which is independent from n. We also analyse a variant which is closer to the algorithm suggested earlier. We prove a logarithmic bound on the expected regret of this algorithm and argue that the bound scales favourably with the variance of the suboptimal arms.
关 键 词: UCB算法; 决策; 算法
课程来源: 视频讲座网
最后编审: 2020-10-22:chenxin
阅读次数: 298