0


使用灰色代码进行聚类的最小代码长度

The Minimum Code Length for Clustering Using the Gray Code
课程网址: http://videolectures.net/ecmlpkdd2011_sugiyama_minimum/  
主讲教师: Mahito Sugiyama
开课单位: 京都大学
开课时间: 2011-10-03
课程语种: 英语
中文简介:
我们提出了利用压缩算法来聚类数值数据的新方法。我们的第一个贡献是设计一种度量,该度量可以根据固定编码方案对给定聚类结果的质量进行评分。我们将此度量称为最小代码长度(MCL)。我们的第二个贡献是提出一种将任何编码方法转换为聚类算法的一般策略,我们将其称为COOL(COding Oriented cLustering)。 COOL具有较低的计算成本,因为它与数据集大小成线性比例。 COOL的聚类结果也显示出最小化MCL。为了进一步说明这种方法,我们将格雷码作为呈现G COOL的编码方案。 G COOL可以找到任意形状的簇并去除噪声。而且,输入参数的变化很稳健;它只需要两个下限来确定簇的数量和每个簇的大小,而大多数用于查找任意形状簇的算法只有在适当调整所有参数的情况下才能正常工作。从理论上讲,G COOL可以实现内部内聚和外部隔离,实验证明它对合成数据集和实际数据集都有效。
课程简介: We propose new approaches to exploit compression algorithms for clustering numerical data. Our first contribution is to design a measure that can score the quality of a given clustering result under the light of a fixed encoding scheme. We call this measure the Minimum Code Length (MCL). Our second contribution is to propose a general strategy to translate any encoding method into a cluster algorithm, which we call COOL (COding-Oriented cLustering). COOL has a low computational cost since it scales linearly with the data set size. The clustering results of COOL is also shown to minimize MCL. To illustrate further this approach, we consider the Gray Code as the encoding scheme to present G-COOL. G-COOL can find clusters of arbitrary shapes and remove noise. Moreover, it is robust to change in the input parameters; it requires only two lower bounds for the number of clusters and the size of each cluster, whereas most algorithms for finding arbitrarily shaped clusters work well only if all parameters are tuned appropriately. G-COOL is theoretically shown to achieve internal cohesion and external isolation and is experimentally shown to work well for both synthetic and real data sets.
关 键 词: 压缩算法; 固定编码; 聚类算法
课程来源: 视频讲座网
最后编审: 2019-04-03:lxf
阅读次数: 81