0


有监督学习保证的上下文Bandit算法,包括Brendan McMahan的讨论

Contextual Bandit Algorithms with Supervised Learning Guarantees, incl. discussion by Brendan McMahan
课程网址: http://videolectures.net/aistats2011_reyzin_bandit/  
主讲教师: Brendan McMahan
开课单位: 谷歌公司
开课时间: 2011-03-06
课程语种: 英语
中文简介:

我们解决了在非随机强盗情况下与任何大型$ N $策略竞争的问题,在这种策略中,学习者必须在$ K $动作中反复选择,但只能观察到所选动作的收益。我们提出了对Auer等人的Exp4算法的修改。称为Exp4.P,极有可能引起最多$ O(sqrt {KTln N})$的遗憾。由于算法中使用的重要性加权估计值存在较大差异,因此对于Exp4而言,此界限不成立。在大规模,真实世界的数据集中对新算法进行了经验测试。对于该问题的随机版本,我们可以使用Exp4.P作为子例程,与VC维$ d $的一组可能无限的策略竞争,同时极有可能引起最多$ O(sqrt {Tdln T})$的遗憾。 。这些保证在随机或对抗环境中都改进了以前所有算法的保证,并使我们更接近为这种设置提供与标准监督学习中的保证相当的保证。

课程简介: We address the problem of competing with any large set of $N$ policies in the non-stochastic bandit setting, where the learner must repeatedly select among $K$ actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of Auer et al. called Exp4.P, which with high probability incurs regret at most $O(sqrt{KTln N})$. Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension $d$ while incurring regret at most $ O(sqrt{Tdln T})$ with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.
关 键 词: 监督学习; 策略竞争
课程来源: 视频讲座网
数据采集: 2020-12-29:zyk
最后编审: 2020-12-29:zyk
阅读次数: 64