0


流子模块最大化:海量数据动态摘要

Streaming Submodular Maximization: Massive Data Summarization on the Fly
课程网址: http://videolectures.net/kdd2014_mirzasoleiman_submodular_maximiz...  
主讲教师: Baharan Mirzasoleiman
开课单位: 苏黎世理工学院
开课时间: 2014-10-07
课程语种: 英语
中文简介:

一个人怎么能“动态”地总结一个庞大的数据集,甚至还没有完整地看到它呢?在本文中,我们解决了从大量数据中提取代表性元素的问题。即,我们希望根据某个目标函数从流中选择最具代表性的k个数据点的子集。许多自然的“代表性”概念都满足亚模量,这是收益递减的直观概念。因此,可以减少这样的问题,以使受基数约束的子模集函数最大化。次模块最大化的经典方法需要完全访问数据集。我们开发了第一个有效的流算法,该算法具有恒定系数1/2ε的近似保证,可提供最佳解决方案,只需要对数据进行一次遍历,而内存与数据大小无关。在我们的实验中,我们在数百万个数据点上广泛评估了我们的方法在几种应用中的有效性,包括训练大规模内核方法和基于示例的聚类。我们观察到,我们的流传输方法在实现几乎相同的实用价值的同时,其运行速度比以前的工作快约100倍。

课程简介: How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.
关 键 词: 子模集函数; 数据集
课程来源: 视频讲座网
数据采集: 2020-12-29:zyk
最后编审: 2020-12-29:zyk
阅读次数: 113