
S-means: similarity driven clustering and its application in gravitational-wave astronomy data mining
课程网址: http://videolectures.net/ecml07_tang_msd/  
主讲教师: Lappoon R. Tang
开课单位: 德克萨斯大学
开课时间: 2008-01-29
课程语种: 英语
聚类分析是将未标记的数据分类到组中。几十年来, 它在许多学科中都得到了很好的研究。多传感器网络产生的大量天文数据的聚类已成为一个新的挑战;在这些域中, 许多现有的聚类算法中的假设经常被违反。例如, k 表示隐式假定数据的基础分布是高斯。这种假设不一定在天文数据中得到观察。另一个问题是 k 的确定, 这很难决定什么时候缺乏先验知识。虽然只给出数据就发现了 k 的适当价值, 但大多数现有的作品, 如 x-手段、g-手段和 pg-手段, 都假定模型在某种程度上是高斯的混合体。本文提出了一种以相似为导向的聚类方法来解决大规模的聚类问题。利用相似性阈值 t 约束可能的聚类分析模型的搜索空间, 使只接受满足阈值的模型。这迫使搜索: 1) 明确避免陷入局部极小值, 因此学到的模型质量有一个有意义的下限, 2) 发现 k 的适当值, 因为如果将它们合并到现有的集群将违反该模型, 则必须为 k 形成一个适当的值约束给的门槛。在 uci kdd 档案和为激光干涉仪引力波观测台 (ligo) 生成的逼真模拟数据上的实验结果表明, 这种方法是有希望的。
课程简介: Clustering is to classify unlabeled data into groups. It has been wellresearched for decades in many disciplines. Clustering in massive amount of astronomical data generated by multi-sensor networks has become an emerging new challenge; assumptions in many existing clustering algorithms are often violated in these domains. For example, K means implicitly assumes that underlying distribution of data is Gaussian. Such an assumption is not necessarily observed in astronomical data. Another problem is the determination of K, which is hard to decide when prior knowledge is lacking. While there has been work done on discovering the proper value for K given only the data, most existing works, such as X-means, G-means and PG-means, assume that the model is a mixture of Gaussians in one way or another. In this paper, we present a similarity-driven clustering approach for tackling large scale clustering problem. A similarity threshold T is used to constrain the search space of possible clustering models such that only those satisfying the threshold are accepted. This forces the search to: 1) explicitly avoid getting stuck in local minima, and hence the quality of models learned has a meaningful lower bound, and 2) discover a proper value for K as new clusters have to be formed if merging them into existing ones will violate the constraint given by the threshold. Experimental results on the UCI KDD archive and realistic simulated data generated for the Laser Interferometer Gravitational Wave Observatory (LIGO) suggest that such an approach is promising.
关 键 词: 计算机科学; 机器学习; 集群
课程来源: 视频讲座网
最后编审: 2020-07-29:yumf
阅读次数: 32