0


HyperLogLog超扩展:凹次线性频率统计草图

HyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency Statistics
课程网址: http://videolectures.net/kdd2017_cohen_hyperloglog/  
主讲教师: Edith Cohen
开课单位: 视频讲座网
开课时间: 2017-10-09
课程语种: 英语
中文简介:
对数据元素计算的最常见的统计数据之一是不同键的数量。Flajolet和Martin在30年前开创了一个研究思路,最终设计出了最优近似计数草图,这种草图的大小在不同键的数量上是双对数的,并提供了一个相对误差很小的估计。此外,草图是可组合的,因此适合于流计算、并行计算或分布式计算。 我们在这里考虑键的频率分布的所有统计数据,其中键对集合的贡献是凹的,并随其频率(次)线性增长。这些基本聚合在文本、图和日志分析中非常常见,包括对数、低频矩和封顶统计信息。 我们为所有凹次线性统计量设计了双对数尺寸的可组合草图。我们的设计结合了理论的优化和实际的简洁。简而言之,我们指定了数据元素到输出元素的定制映射函数,以便我们在数据元素上的目标统计数据由输出元素的(最大)不同统计数据近似化,输出元素可以使用现成的草图近似化。我们的关键见解是将这些目标统计数据与输入频率的{\em补拉普拉斯}变换联系起来。
课程简介: One of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that is double logarithmic in the number of distinct keys and provide estimates with a small relative error. Moreover, the sketches are composable, and thus suitable for streamed, parallel, or distributed computation. We consider here all statistics of the frequency distribution of keys, where a contribution of a key to the aggregate is concave and grows (sub)linearly with its frequency. These fundamental aggregations are very common in text, graphs, and logs analysis and include logarithms, low frequency moments, and capping statistics. We design composable sketches of double-logarithmic size for all concave sublinear statistics. Our design combines theoretical optimality and practical simplicity. In a nutshell, we specify tailored mapping functions of data elements to output elements so that our target statistics on the data elements is approximated by the (max-) distinct statistics of the output elements, which can be approximated using off-the-shelf sketches. Our key insight is relating these target statistics to the {em complement Laplace} transform of the input frequencies.
关 键 词: 统计数据; 计数草图; 相对误差; 并行计算
课程来源: 视频讲座网
数据采集: 2022-11-30:chenxin01
最后编审: 2022-11-30:chenxin01
阅读次数: 21