首页概率论
0


挖掘频繁项的前K在一个灵活的滑动窗口的数据流

Mining Top-K Frequent Items in a Data Stream with Flexible Sliding Windows
课程网址: http://videolectures.net/kdd2010_thanh_lam_mtf/  
主讲教师: Hoang Thanh Lam
开课单位: 埃因霍温工业大学
开课时间: 2010-10-01
课程语种: 英语
中文简介:
我们研究了在最近提出的最大频率测量的项目流中找到k个最频繁项目的问题。根据项目的属性,项目的最大频率在滑动窗口上计算,其长度动态变化。除了无参数之外,这种测量项目支持的方式被证明具有更快检测流中突发的优点,特别是如果项目集是异构的。然而,建议用于维护所有频繁项目的算法在项目数量变大时缩小。因此,在本文中,我们建议,不是报告所有频繁项目,而是仅挖掘最常见的$ k $。首先我们证明,为了准确地解决这个问题,我们仍然需要大量的内存(至少是项目数量的线性)。然而,在一些合理的条件下,我们在理论上和经验上都表明存在一种记忆效率高的算法。实现了该算法的原型,并且我们展示了它的性能w.r.t.实际数据和合成数据控制实验的记忆效率。
课程简介: We study the problem of finding the k most frequent items in a stream of items for the recently proposed max-frequency measure. Based on the properties of an item, the max-frequency of an item is counted over a sliding window of which the length changes dynamically. Besides being parameterless, this way of measuring the support of items was shown to have the advantage of a faster detection of bursts in a stream, especially if the set of items is heterogeneous. The algorithm that was proposed for maintaining all frequent items, however, scales poorly when the number of items becomes large. Therefore, in this paper we propose, instead of reporting all frequent items, to only mine the top-$k$ most frequent ones. First we prove that in order to solve this problem exactly, we still need a prohibitive amount of memory (at least linear in the number of items). Yet, under some reasonable conditions, we show both theoretically and empirically that a memory-efficient algorithm exists. A prototype of this algorithm is implemented and we present its performance w.r.t. memory-efficiency on real-life data and in controlled experiments with synthetic data.
关 键 词: 测量项目; 突发检测; 内存算法
课程来源: 视频讲座网
最后编审: 2020-06-29:zyk
阅读次数: 34