0


广义无悔问题的机会策略

Opportunistic Strategies for Generalized No-Regret Problems
课程网址: http://videolectures.net/colt2013_bernstein_problems/  
主讲教师: Andrey Bernstein
开课单位: 以色列理工学院
开课时间: 2013-08-09
课程语种: 英语
中文简介:
无悔算法在在线学习和预测问题中发挥了重要作用。在本文中,我们关注一个广义的无悔问题与向量价值的奖励,定义为一个期望的奖励集的代理。对于对手的每一个混合动作q,代理都有一个R(q)集合,其中应该包含平均奖励。此外,agent有一个响应混合动作p,在r(p,q)和r(q)这两个动作下带来了预期的奖励。如果agent的策略确保平均报酬收敛到R(q¯n),其中q¯n是对手行为的经验分布,对于任何对手的策略,我们都说这是一个关于R(q)的无悔策略。标准无悔问题作为标量奖励的特例,R(q) = R∈R: R≥R(q),其中R(q) =maxpr(p,q)。针对这一问题,多函数R(q)是凸的,可以设计无悔策略。我们对这篇论文的主要兴趣是在这种凸性不成立的情况下。一般来说,能够保证的最好的结果就是平均奖励对Rc(q¯n)的收敛,也就是R(q¯n)的凸包。然而,随着游戏的展开,可能会发现对手的行动选择在某种程度上是有限的。如果事先知道这些限制,agent可能会确保Rc(q¯n)的某个期望子集的平均报酬收敛,甚至接近R(q¯n)本身。我们为机会主义的无悔策略制定了适当的目标,因为他们可能会在不事先知道的情况下,以在线方式利用对手行动顺序的这种限制。作为主要的技术工具,我们提出了一类依赖于对手行为的校准预测的可接近性算法,这类算法在上述意义上是机会主义的。作为一个应用,我们考虑了Mannor, Tsitsiklis, and Yu(2009)中引入的平均成本约束的在线无悔问题,在这个问题中,事后最好的反应通常不是可以得到的,而是它的凸松弛。我们提出的算法适用于这个问题,如果对手的游戏恰好是平稳的(无论是其混合动作,还是其纯动作的经验频率),我们的算法确实能获得事后最佳反应。
课程简介: No-regret algorithms has played a key role in on-line learning and prediction problems. In this paper, we focus on a generalized no-regret problem with vector-valued rewards, defined in terms of a desired reward set of the agent. For each mixed action q of the opponent, the agent has a set R(q) where the average reward should reside. In addition, the agent has a response mixed action p which brings the expected reward under these two actions, r(p,q), to R(q). If a strategy of the agent ensures that the average reward converges to R(q¯n), where q¯n is the empirical distribution of the opponent’s actions, for any strategy of the opponent, we say that it is a no-regret strategy with respect to R(q). The standard no-regret problem is obtained as a special case for scalar rewards and R(q) = r∈R:r≥r(q), where r(q)=maxpr(p,q). For this problem, the multifunction R(q) is convex, and no-regret strategies can be devised. Our main interest in this paper is in cases where this convexity property does not hold. The best that can be guaranteed in general then is the convergence of the average reward to Rc(q¯n), the convex hull of R(q¯n). However, as the game unfolds, it may turn out that the opponent’s choices of actions are limited in some way. If these restrictions were known in advance, the agent could possibly ensure convergence of the average reward to some desired subset of Rc(q¯n), or even approach R(q¯n) itself. We formulate appropriate goals for opportunistic no-regret strategies, in the sense that they may exploit such limitations on the opponent’s action sequence in an on-line manner, without knowing them beforehand. As the main technical tool, we propose a class of approachability algorithms that rely on a calibrated forecast of the opponent’s actions, which are opportunistic in the above mentioned sense. As an application, we consider the online no-regret problem with average cost constraints, introduced in Mannor, Tsitsiklis, and Yu (2009), where the best-response-in-hindsight is not generally attainable, but only its convex relaxation. Our proposed algorithm applied to this problem does attain the best-response-in-hindsight if the opponent’s play happens to be stationary (either in terms of its mixed actions, or the empirical frequencies of its pure actions).
关 键 词: 无悔算法; 在线学习; 预测问题; 向量价值; 期望奖励集; 标量问题; 可接近性算法
课程来源: 视频讲座网公开课
最后编审: 2019-05-26:cwx
阅读次数: 86