0


大数据采样

Sampling for Big Data
课程网址: http://videolectures.net/kdd2014_cormode_duffield_sampling_data/  
主讲教师: Nick Duffield; Graham Cormode
开课单位: 德克萨斯农工大学
开课时间: 2014-10-08
课程语种: 英语
中文简介:

对大型数据集激增的一种应对方法是,开发出巧妙的方法来使用问题的解决方案,使用大规模的容错存储架构,并行和图形计算模型(例如MapReduce,Pregel和Giraph)来解决问题。但是,并非所有环境都可以支持这种规模的资源,并且并非所有查询都需要确切的响应。这激励了使用采样来生成支持快速查询的摘要数据集,并延长了存储中数据的使用寿命。为了有效,抽样必须调解资源约束,数据特征和所需查询准确性之间的紧张关系。采样的最新技术远远超出了简单均匀选择元素的范围,以最大程度地提高所得采样的有用性。本教程回顾了大型数据集(包括流和图形结构化数据)的样本设计进展。讨论了在网络流量和社交网络采样中的应用。

课程简介: One response to the proliferation of large datasets has been to develop ingenious ways to throw resources at the problem, using massive fault tolerant storage architectures, parallel and graphical computation models such as MapReduce, Pregel and Giraph. However, not all environments can support this scale of resources, and not all queries need an exact response. This motivates the use of sampling to generate summary datasets that support rapid queries, and prolong the useful life of the data in storage. To be effective, sampling must mediate the tensions between resource constraints, data characteristics, and the required query accuracy. The state-of-the-art in sampling goes far beyond simple uniform selection of elements, to maximize the usefulness of the resulting sample. This tutorial reviews progress in sample design for large datasets, including streaming and graph-structured data. Applications are discussed to sampling network traffic and social networks.
关 键 词: 大数据; 数据集; 数据特征
课程来源: 视频讲座网
数据采集: 2020-11-04:zyk
最后编审: 2020-11-04:zyk
阅读次数: 41