0


强凸随机优化的梯度下降最优化

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
课程网址: http://videolectures.net/nipsworkshops2011_shamir_convex/  
主讲教师: Ohad Samir
开课单位: 微软
开课时间: 2012-01-25
课程语种: 英语
中文简介:
随机梯度下降(SGD)是解决机器学习中随机优化问题的一种简单而流行的方法。对于强凸问题,通过运行SGD进行T次迭代并返回平均点,其收敛速度已知为O(log(T)/T)。然而,最近的结果表明,使用不同的算法,可以获得最佳的O(1/T)速率。这可能会导致人们认为标准SGD是次优的,甚至应该取代它作为一种选择方法。在本文中,我们研究了随机环境下SGD的最优性。结果表明,对于光滑问题,该算法能获得最优的O(1/T)速率。然而,对于非光滑问题,平均的收敛速度可能真的是(log(T)/T),这不仅仅是分析的产物。另一方面,我们证明了对平均步骤的简单修改足以恢复O(1/T)速率,并且不需要对算法进行其他更改。我们还提出了实验结果,支持我们的发现,并指出了开放的问题。
课程简介: Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be (log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.
关 键 词: 随机梯度下降; 优化问题; 收敛速度
课程来源: 视频讲座网
数据采集: 2022-12-16:chenjy
最后编审: 2022-12-16:chenjy
阅读次数: 20