0


比较苹果和桔子 - 数据挖掘结果之间的差异测量

Comparing Apples and Oranges - Measuring Differences between Data Mining Results
课程网址: http://videolectures.net/ecmlpkdd2011_vreeken_differences/  
主讲教师: Jilles Vreeken
开课单位: 安特卫普大学
开课时间: 2011-11-30
课程语种: 英语
中文简介:
确定两种不同挖掘算法的结果是否提供了明显不同的信息是探索性数据挖掘中一个重要的开放问题。无论目标是选择最翔实的结果进行分析, 还是决定哪种挖掘方法可能提供最新颖的见解, 我们都必须能够判断两个结果提供的信息有多大的不同。本文对二进制数据的探索性结果进行了比较, 迈出了第一步。我们建议将结果有意义地转换为一组嘈杂的磁贴, 并通过最大熵建模和 kulllebleblebler 发散来比较这些集合。我们以这种方式构建的度量方法是灵活的, 它允许我们自然地包括背景知识, 这样就可以从用户已经知道的角度来衡量结果的差异。此外, 除了它的可解释性之外, 当我们只考虑精确的瓷砖时, 它与贾卡德的不同相吻合。我们的方法提供了一种方法来研究和区分不同数据挖掘方法的结果之间的差异。作为一个应用程序, 我们表明, 它还可以用来确定结果的哪些部分最好地重新描述其他结果。实验评价显示, 我们的测量给出了有意义的结果, 正确识别了性质相似的方法, 并自动提供了对结果的合理的重新描述。
课程简介: Deciding whether the results of two different mining algorithms provide significantly different information is an important open problem in exploratory data mining. Whether the goal is to select the most informative result for analysis, or decide which mining approach will likely provide the most novel insight, it is essential that we can tell how different the information is that two results provide. In this paper we take a first step towards comparing exploratory results on binary data. We propose to meaningfully convert results into sets of noisy tiles, and compare between these sets byMaximum Entropy modelling and Kullback-Leibler divergence. The measure we construct this way is flexible, and allows us to naturally include background knowledge, such that differences in results can be measured from the perspective of what a user already knows. Furthermore, adding to its interpretability, it coincides with Jaccard dissimilarity when we only consider exact tiles. Our approach provides a means to study and tell differences between results of different data mining methods. As an application, we show that it can also be used to identify which parts of results best redescribe other results. Experimental evaluation shows our measure gives meaningful results, correctly identifies methods that are similar in nature, and automatically provides sound redescriptions of results.
关 键 词: 计算机科学; 数据挖掘; 差异测量
课程来源: 视频讲座网
最后编审: 2020-06-13:邬启凡(课程编辑志愿者)
阅读次数: 34