信任区分离策略优化方法Separated Trust Regions Policy Optimization Method |
|
课程网址: | http://videolectures.net/kdd2019_zou_zhuang_cheng/ |
主讲教师: | Luobao Zou |
开课单位: | 上海交通大学 |
开课时间: | 2020-03-02 |
课程语种: | 英语 |
中文简介: | 在这项工作中,我们提出了一种适度的政策更新方法以进行强化学习,这鼓励特工在早期情节中进行更大胆的探索,但对政策进行谨慎的更新。基于最大熵框架,我们提出了一个具有更保守约束的较软目标,并建立了分离的信任区域进行优化。为了减少期望熵回报的方差,优选使用高斯分布的计算状态策略熵,而不是通过采样收集对数概率。我们将这种称为策略均值和方差(STRMV)的分离信任区的新方法视为对近端策略优化(PPO)的扩展,但它对策略更新更为温和,对于探索更活跃。我们在MuJoCo环境中的各种连续控制基准任务上测试了我们的方法。实验表明,STRMV在政策方法方面优于现有技术,不仅获得了更高的回报,而且还提高了样本效率。 p> |
课程简介: | In this work, we propose a moderate policy update method for reinforcement learning, which encourages the agent to explore more boldly in early episodes but updates the policy more cautious. Based on the maximum entropy framework, we propose a softer objective with more conservative constraints and build the separated trust regions for optimization. To reduce the variance of expected entropy return, a calculated state policy entropy of Gaussian distribution is preferred instead of collecting log probability by sampling. This new method, which we call separated trust region for policy mean and variance (STRMV), can be view as an extension to proximal policy optimization (PPO) but it is gentler for policy update and more lively for exploration. We test our approach on a wide variety of continuous control benchmark tasks in the MuJoCo environment. The experiments demonstrate that STRMV outperforms the previous state of art on-policy methods, not only achieving higher rewards but also improving the sample efficiency. |
关 键 词: | 最大熵框架; 信任区域; 高斯分布 |
课程来源: | 视频讲座网 |
数据采集: | 2020-11-29:cjy |
最后编审: | 2020-11-29:cjy |
阅读次数: | 55 |