0


约束驱动的聚类

Constraint-Driven Clustering
课程网址: http://videolectures.net/kdd07_ge_cdc/  
主讲教师: Rong Ge
开课单位: 西蒙弗雷泽大学
开课时间: 2007-09-14
课程语种: 英语
中文简介:
聚类方法可以是数据驱动的,也可以是需要驱动的。数据驱动方法旨在发现底层数据的真实结构,同时需要驱动方法旨在组织真实结构以满足某些应用程序要求。因此,需求驱动(例如,受约束的)聚类能够在诸如能量感知传感器网络,隐私保护和市场分割的应用中找到更有用和可操作的集群。但是,现有的约束聚类方法要求用户提供聚类数量,这些聚类通常是事先未知的,但对聚类结果有重要影响。在本文中,我们认为生成可操作集群的更自然的方法是让特定于应用程序的约束决定集群的数量。为此,我们引入了一种新的聚类模型,约束驱动聚类(CDC),它找到了一个先验未指定数量的紧凑聚类,满足所有用户提供的约束。考虑两种一般类型的约束,即最小重要性约束和最小方差约束,以及这两种类型的组合。我们用不同的约束证明了CDC问题的NP硬度。我们提出了一种新颖的动态数据结构,即CD树,它在叶节点中组织数据点,使得每个叶节点近似满足CDC约束并最小化目标函数。基于CD树,我们开发了一种有效的算法来解决新的聚类问题。我们对合成和真实数据集的实验评估证明了生成的集群的质量和算法的可扩展性。
课程简介: Clustering methods can be either data-driven or need-driven. Data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. Thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. However, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. In this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. For this purpose, we introduce a novel cluster model, Constraint-Driven Clustering (CDC), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. Two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. We prove the NP-hardness of the CDC problem with different constraints. We propose a novel dynamic data structure, the CD-Tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the CDC constraints and minimizes the objective function. Based on CD-Trees, we develop an efficient algorithm to solve the new clustering problem. Our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm.
关 键 词: 聚类方法; 数据驱动; 底层数据
课程来源: 视频讲座网
最后编审: 2019-05-08:cwx
阅读次数: 48