0


大规模机器学习的超越随机梯度下降

Beyond stochastic gradient descent for large-scale machine learning
课程网址: http://videolectures.net/sahd2014_bach_stochastic_gradient/  
主讲教师: Francis R. Bach
开课单位: 法国国家信息与自动化研究所
开课时间: 2014-10-29
课程语种: 英语
中文简介:

许多传统上将机器学习和信号处理问题归为凸优化问题。解决这些问题的一个常见困难是数据的大小,其中有许多观测值(“大n”),而每个观测值都很大(“大p”)。在这种情况下,通常仅通过一次数据的在线算法(例如随机梯度下降)优于批处理算法,后者需要多次通过数据。给定n个观测值/迭代次数,这些算法的最佳收敛速度对于一般凸函数为O(1 / sqrt {n}),而对于强凸函数则为O(1 / n)。在本次演讲中,我将展示如何在理论和实践上使用损失函数的平滑度设计具有改进行为的新颖算法:在理想的无限数据设置中,高效的新颖的基于牛顿的随机逼近算法可导致收敛速度为O(1 / n)没有强凸性假设,而在实际有限数据设置中,批处理和在线算法的适当组合会导致意外行为,例如强凸问题的线性收敛速度,其迭代成本与随机性相似梯度下降。 (与Nicolas Le Roux,Eric Moulines和Mark Schmidt共同合作)

课程简介: Many machine learning and signal processing problems are traditionally cast as convex optimization problems. A common difficulty in solving these problems is the size of the data, where there are many observations ("large n") and each of these is large ("large p"). In this setting, online algorithms such as stochastic gradient descent which pass over the data only once, are usually preferred over batch algorithms, which require multiple passes over the data. Given n observations/iterations, the optimal convergence rates of these algorithms are O(1/sqrt{n}) for general convex functions and reaches O(1/n) for strongly-convex functions. In this talk, I will show how the smoothness of loss functions may be used to design novel algorithms with improved behavior, both in theory and practice: in the ideal infinite-data setting, an efficient novel Newtonbased stochastic approximation algorithm leads to a convergence rate of O(1/n) without strong convexity assumptions, while in the practical finite-data setting, an appropriate combination of batch and online algorithms leads to unexpected behaviors, such as a linear convergence rate for strongly convex problems, with an iteration cost similar to stochastic gradient descent. (joint work with Nicolas Le Roux, Eric Moulines and Mark Schmidt)
关 键 词: 机器学习; 算法; 观测值
课程来源: 视频讲座网
数据采集: 2020-10-09:zyk
最后编审: 2021-01-15:yumf
阅读次数: 38