0


集群比较的信息理论措施:是否需要修正?

Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary?
课程网址: http://videolectures.net/icml09_vinh_itm/  
主讲教师: Nguyen Xuan Vinh
开课单位: 新南威尔士大学
开课时间: 2009-08-26
课程语种: 英语
中文简介:
基于信息理论的度量构成了比较聚类的一类基本相似度量,除了基于对计数的度量和基于集匹配的度量之外。本文讨论了基于信息论的聚类比较方法的时机修正的必要性。我们观察到,这些度量的基线,即在一个数据集的随机划分下的平均值,并不是一个常数,当数据点的数量和集群的数量之比很小时,往往会有较大的变化。这种效应在其他一些非信息理论基础的措施中类似,如著名的兰德指数。假设一个超几何的随机性模型,推导了一对聚类之间期望互信息值的解析公式,并提出了几种常用的基于信息理论的测度的调整版本。举例说明了调整措施的必要性和实用性。
课程简介: Information theoretic based measures form a fundamental class of similarity measures for comparing clusterings, beside the class of pair-counting based and set-matching based measures. In this paper, we discuss the necessity of correction for chance for information theoretic based measures for clusterings comparison. We observe that the baseline for such measures, i.e. average value under random partitioning of a data set, does not take on a constant value, and tends to have larger variation when the ratio between the number of data points and the number of clusters is small. This effect is similar in some other non-information theoretic based measures such as the well-known Rand Index. Assuming a hypergeometric model of randomness, we derive the analytical formula for the expected mutual information value between a pair of clusterings, and then propose the adjusted version for several popular information theoretic based measures. Some examples are given to demonstrate the need and usefulness of the adjusted measures.
关 键 词: 机器学习; 聚类; 分析模型
课程来源: 视频讲座网
最后编审: 2019-12-04:lxf
阅读次数: 56