0


基于偏好的政策迭代:利用偏好进行强化学习

Preference-based policy iteration: Leveraging preference learning for reinforcement learning
课程网址: http://videolectures.net/ecmlpkdd2011_furnkranz_iteration/  
主讲教师: Johannes Fürnkranz
开课单位: 达姆施塔特工业大学
开课时间: 2011-11-30
课程语种: 英语
中文简介:
本文迈出了机器学习两个子领域整合的第一步,即偏好学习和强化学习(RL)。基于“偏好”的强化学习方法的一个重要动机是可以扩展代理可以学习的反馈类型。特别是,虽然传统的RL方法基本上局限于处理数字奖励,但是存在许多应用,其中这种类型的信息不是自然可用的,并且其中仅提供定性奖励信号。因此,在偏好学习的新方法的基础上,我们的总体目标是为RL代理配备定性策略模型,例如允许将其可用动作从大多数到最不可能的行为进行排序的排名函数,以及用于从中学习此类模型的算法。定性反馈。具体地说,在本文中,我们建立了一种基于推出的近似策略迭代的现有方法。虽然这种方法基于使用分类方法进行泛化和策略学习,但我们使用了一种称为标签排名的特定类型的偏好学习方法。通过两个案例研究说明了我们基于偏好的策略迭代方法的优点。
课程简介: This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a "preference-based" approach to reinforcement learning is a possible extension of the type of feedback an agent may learn from. In particular, while conventional RL methods are essentially confined to deal with numerical rewards, there are many applications in which this type of information is not naturally available, and in which only qualitative reward signals are provided instead. Therefore, building on novel methods for preference learning, our general goal is to equip the RL agent with qualitative policy models, such as ranking functions that allow for sorting its available actions from most to least promising, as well as algorithms for learning such models from qualitative feedback. Concretely, in this paper, we build on an existing method for approximate policy iteration based on roll-outs. While this approach is based on the use of classification methods for generalization and policy learning, we make use of a specific type of preference learning method called label ranking. Advantages of our preference-based policy iteration method are illustrated by means of two case studies.
关 键 词: 偏好学习; 强化学习; 机器学习
课程来源: 视频讲座网
最后编审: 2019-04-02:cwx
阅读次数: 88