0


超越MapReduce的横向扩展

Scale-out Beyond MapReduce
课程网址: http://videolectures.net/kdd2013_ramakrishnan_map_reduce/  
主讲教师: Raghu Ramakrishnan
开课单位: 微软公司
开课时间: 2013-09-27
课程语种: 英语
中文简介:

正在收集的数据量以惊人的速度增长。默认设置是捕获和存储所有数据,以预期潜在的未来战略价值,并且通过检测关键的客户和系统接触点来生成大量数据。直到最近,仍为明确目标收集数据,例如审计,取证,报告和业务范围;现在,探索性和预测性分析正变得无处不在。数据规模和使用率的这些差异导致了新一代数据管理和分析系统的发展,其中重点是支持使用最合适的任何技术(包括传统工具,例如SQL)支持广泛存储要统一存储和无缝分析的数据以及BI和用于图形分析和机器学习的新型工具。这些新系统将横向扩展体系结构用于数据存储和计算。 Hadoop已成为新一代横向扩展系统的关键构建块。 Hadoop上的早期分析工具,例如Hive和Pig,用于类似SQL的查询,是通过翻译成Map Reduce计算来实现的。这种方法具有固有的局限性,诸如YARN和Mesos之类的资源管理器的出现为新型分析工具绕过Map Reduce层打开了大门。这种趋势对于诸如图形分析和机器学习之类的迭代计算尤为重要,而Map Reduce被广泛认为是不合适的。在本次演讲中,我将探讨这种架构趋势,并认为资源管理器是重构Map Reduce早期实现的第一步,如果我们希望在一个通用规模上支持各种分析工具,则需要做更多的工作。计算结构。然后,我将介绍REEF,它在YARN之类的资源管理器上运行,并为任务监视和重新启动,数据移动和通信以及分布式状态管理提供支持。最后,我将说明使用REEF在图分析和机器学习中实现迭代算法的价值。

课程简介: The amount of data being collected is growing at a staggering pace. The default is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated by instrumenting key customer and systems touchpoints. Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous. These differences in data scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of data to be stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools for graph analytics and machine learning. These new systems use scale-out architectures for both data storage and computation. Hadoop has become a key building block in the new generation of scale-out systems. Early versions of analytic tools over Hadoop, such as Hive and Pig for SQL-like queries, were implemented by translation into Map-Reduce computations. This approach has inherent limitations, and the emergence of resource managers such as YARN and Mesos has opened the door for newer analytic tools to bypass the Map-Reduce layer. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit. In this talk, I will examine this architectural trend, and argue that resource managers are a first step in re-factoring the early implementations of Map-Reduce, and that more work is needed if we wish to support a variety of analytic tools on a common scale-out computational fabric. I will then present REEF, which runs on top of resource managers like YARN and provides support for task monitoring and restart, data movement and communications, and distributed state management. Finally, I will illustrate the value of using REEF to implement iterative algorithms for graph analytics and machine learning.
关 键 词: 数据挖掘; 知识提取; Hadoop
课程来源: 视频讲座网
数据采集: 2020-06-11:吴淑曼
最后编审: 2020-06-15:cxin
阅读次数: 31