0


基于参数服务器的分布式学习系统及其在阿里巴巴和蚂蚁金服中的应用

KunPeng: Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial
课程网址: http://videolectures.net/kdd2017_li_distributed_learning/  
主讲教师: Xiaolong Li
开课单位: 蚂蚁金服
开课时间: 2017-10-09
课程语种: 英语
中文简介:
近年来,由于大数据(tb或pb)和大模型(数百亿参数)的出现,学术界和工业界对并行机器学习(ML)算法的需求越来越大。虽然有一些现有的分布式计算系统(如Hadoop和Spark)用于并行ML算法,但它们只提供同步和粗粒度的操作符(例如Map、Reduce和Join等),这可能会阻碍开发人员实现更高效的算法。这促使我们设计了一个通用的分布式平台,名为“鲲鹏”,它结合了分布式系统和并行优化算法来处理大规模机器学习带来的复杂性。具体来说,“鲲鹏”不仅封装了数据/模型并行性、负载平衡、模型同步、稀疏表示、工业容错等特性,而且提供了易于使用的界面,使用户能够专注于机器学习的核心逻辑。对数十亿个样本和特征的tb真实数据集的经验结果表明,这样的设计为ML程序带来了令人信服的性能改进,从遵循正则领先近端算法到稀疏逻辑回归和多可加性回归树。此外,鲲鹏的令人鼓舞的表现也体现在几个现实世界的应用中,包括阿里巴巴的双11网购节。
课程简介: In recent years, due to the emergence of Big Data (terabytes or petabytes) and Big Model (tens of billions of parameters), there has been an ever-increasing need of parallelizing machine learning (ML) algorithms in both academia and industry. Although there are some existing distributed computing systems, like Hadoop and Spark, for parallelizing ML algorithms, they only provide synchronous and coarse-grained operators (e.g., Map, Reduce, and Join, etc.), which may hinder developers from implementing more efficient algorithms. This motivated us to design a universal distributed platform termed KunPeng, that combines both distributed systems and parallel optimization algorithms to deal with the complexities that arise from large-scale ML. Specifically, KunPeng not only encapsulates the characteristics of data/model parallelism, load balancing, model sync-up, sparse representation, industrial fault-tolerance, etc., but also provides easy-to-use interface to empower users to focus on the core ML logics. Empirical results on terabytes of real datasets with billions of samples and features demonstrate that, such a design brings compelling performance improvements on ML programs ranging from Follow-the-Regularized-Leader Proximal algorithm to Sparse Logistic Regression and Multiple Additive Regression Trees. Furthermore, KunPeng’s encouraging performance is also shown for several real-world applications including the Alibaba’s Double 11 Online Shopping Festival.
关 键 词: 机器学习; 计算系统; 鲲鹏平台
课程来源: 视频讲座网
数据采集: 2023-03-13:chenxin01
最后编审: 2023-05-17:chenxin01
阅读次数: 21