0


Apache Spark中扩展ML的基础

Foundations for Scaling ML in Apache Spark
课程网址: https://videolectures.net/videos/kdd2016_bradley_apache_spark  
主讲教师: Joseph K. Bradley
开课单位: KDD 2016研讨会
开课时间: 2016-10-12
课程语种: 英语
中文简介:
Apache Spark已成为最活跃的开源大数据项目,其机器学习库MLlib的使用量也在快速增长。MLlib和Spark的一个关键方面是扩展能力:笔记本电脑上使用的相同代码可以扩展到100或1000台机器。本次演讲将描述通过与Spark中的两个关键举措集成,使MLlib更快、更具可扩展性的持续和未来努力。第一个是Catalyst,它是DataFrames和Datasets的底层查询优化器。第二个是Tungsten,该项目旨在通过内存管理、缓存感知和代码生成来接近Spark中的裸机速度。本次演讲将讨论MLlib用户和开发人员的目标、挑战和好处。更一般地说,我们将反思将机器学习与大数据分析的许多其他方面相结合的重要性。关于MLlib:MLlib是一个通用的机器学习库,提供许多机器学习算法、特征变换器和用于模型调优和构建工作流的工具。该库受益于与Apache Spark其余部分(SQL、流、Graph、core)的集成,这有助于ETL、流和部署。它用于整个学术界和工业界的即席分析和生产部署。
课程简介: Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis. About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
关 键 词: 大数据项目; 机器学习库; 模型调优
课程来源: 视频讲座网
数据采集: 2024-12-30:liyq
最后编审: 2024-12-30:liyq
阅读次数: 8