0


挖掘统计上重要的等价类

Mining Statistically Important Equivalence Classes
课程网址: http://videolectures.net/kdd07_li_msie/  
主讲教师: Jinyan Li
开课单位: 信息通信研究所
开课时间: 2007-08-14
课程语种: 英语
中文简介:
支持condence框架是项集挖掘算法中最常用的度量,因为它的反单调性有效地简化了搜索点阵。这种计算方便性为许多先前的研究所观察到的结果带来了质量和统计规律。在本文中,我们引入了一种新算法,该算法在复杂的测试统计数据(如卡方,风险比,比值比等)下生成具有排名统计优点的项目集。我们的算法基于等价类的概念。等价类是一组频繁项集,它们总是在同一组事务中一起出现。因此,无论测试统计的多样性如何,等价类中的项集都具有相同的统计重要性。由于等价类可以由闭合模式和一组生成器唯一地确定和简洁地表示,我们只是采用封闭模式和生成器,采用同时深度优先搜索方案。任何先前的工作都没有利用这种并行方法。我们在两个方面评估我们的算法。一般而言,我们将其与LCM和FPclose进行比较,这是针对仅挖掘闭合模式而定制的最佳算法。特别是,我们将与epMiner进行比较,epMiner是挖掘一种相对风险模式的最新算法,称为最小新兴模式。实验结果表明,我们的算法比所有算法都快,有时甚至快了几个数量级。这些统计上排名的模式和效率具有很高的实际应用潜力,特别是在经典测试统计数据占主导地位的生物医学和金融领域。
课程简介: The support condence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that efectively simplifies the search lattice. This computational convenience brings both quality and statistical laws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chi-square, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical signifiance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depth first search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the eficiency have a high potential for real life applications, especially in biomedical and nancial fields where classical test statistics are of dominant interest.
关 键 词: 项集挖掘算法; 搜索点阵; 同时深度优先搜索方案; 最小新兴模式
课程来源: 视频讲座网
最后编审: 2019-05-09:cjy
阅读次数: 20