0


KL-UCB算法用于有界随机带及其以外

The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond
课程网址: https://videolectures.net/videos/colt2011_garivier_bandits  
主讲教师: Aurélien Garivier
开课单位: COLT 2011-布达佩斯
开课时间: 2011-08-02
课程语种: 英语
中文简介:
本文对KL-UCB算法进行了有限时间分析,KL-UCB是一种用于随机土匪问题的在线无视界索引策略。我们证明了两个不同的结果:第一,对于任意有界奖励,KL-UCB算法比UCB及其变体满足了一致更好的后悔界;其次,在伯努利奖励的特殊情况下,它达到了赖和罗宾斯的下限。此外,我们还表明,KL-UCB算法的简单调整对于特定类别的(可能是未绑定的)奖励也是最优的,包括由指数分布族生成的奖励。一项将KL-UCB与其主要竞争对手(UCB、MOSS、UCB-Tuned、UCB-V、DMED)进行比较的大规模数值研究表明,KL-UCB非常高效和稳定,包括在短时间内。KL-UCB也是唯一一种总是比基本UCB策略表现更好的方法。我们的遗憾界限取决于附录中陈述和证明的独立利益的偏差结果。作为副产品,我们还获得了标准UCB算法的改进遗憾界。
课程简介: This paper presents a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KL-UCB algorithm satis fies a uniformly better regret bound than UCB and its variants; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KL-UCB algorithm are also optimal for specifi c classes of (possibly unbounded) rewards, including those generated from exponential families of distributions. A large-scale numerical study comparing KL-UCB with its main competitors (UCB, MOSS, UCB-Tuned, UCB-V, DMED) shows that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix. As a by-product, we also obtain an improved regret bound for the standard UCB algorithm.
关 键 词: KL-UCB算法; 有界随机带; 索引策略
课程来源: 视频讲座网
数据采集: 2024-10-22:liyq
最后编审: 2024-10-22:liyq
阅读次数: 7