使用基于约束的聚类重新定义类定义:一个应用于地球表面遥感的程序Redefining Class Definitions using Constraint-Based Clustering: An Application to Remote Sensing of the Earth's Surface |
|
课程网址: | http://videolectures.net/kdd2010_preston_rcdu/ |
主讲教师: | Preston Dan R |
开课单位: | 塔夫茨大学 |
开课时间: | 2010-10-01 |
课程语种: | 英语 |
中文简介: | 在构建任何现实世界的监督分类任务时,有两个方面是至关重要的:一组类的区别可能对领域专家有用,另一组分类实际上可以通过数据来区分。通常,一组标签是用一些初始的直觉定义的,但这些并不是任务的最佳匹配。例如,已经为地球的土地覆盖分类指定了标签,但人们怀疑这些标签并不理想,一些类别最好划分为子类,而其他类别则应该合并。本文使用三个要素将这个问题形式化:现有的类标签、数据中潜在的可分离性以及来自域专家的特殊类型的输入。我们需要域专家指定一个$L乘以L$的成对概率约束矩阵,表达他们对$L$类是否应该保持分离、合并或分割的信念。这种类型的输入是直观的,便于专家提供。然后,我们将其作为惩罚概率聚类(ppc)的一个实例来解决这个问题。我们的方法,类级ppc(cppc)扩展了ppc,展示了如何将时间复杂性从$o(n^2)$减少到$o(nl)$以解决类的重新定义问题。我们进一步扩展了该算法,提出了一种启发式的方法来度量对约束的遵从性,并为基于约束的集群提供了一个确定模型复杂性(类数)的标准。我们在人工数据和土地覆盖分类的激励范围上演示和评估了CPPC。对于后者,领域专家的评估表明,该算法发现了比原始标签集更适合土地覆盖分类的新类定义。 |
课程简介: | Two aspects are crucial when constructing any real world supervised classification task: the set of classes whose distinction might be useful for the domain expert, and the set of classifications that can actually be distinguished by the data. Often a set of labels is defined with some initial intuition but these are not the best match for the task. For example, labels have been assigned for land cover classification of the Earth but it has been suspected that these labels are not ideal and some classes may be best split into subclasses whereas others should be merged. This paper formalizes this problem using three ingredients: the existing class labels, the underlying separability in the data, and a special type of input from the domain expert. We require a domain expert to specify an $L \times L$ matrix of pairwise probabilistic constraints expressing their beliefs as to whether the $L$ classes should be kept separate, merged, or split. This type of input is intuitive and easy for experts to supply. We then show that the problem can be solved by casting it as an instance of penalized probabilistic clustering (PPC). Our method, Class-Level PPC (CPPC) extends PPC showing how its time complexity can be reduced from $O(N^2)$ to $O(NL)$ for the problem of class re-definition. We further extend the algorithm by presenting a heuristic to measure adherence to constraints, and providing a criterion for determining the model complexity (number of classes) for constraint-based clustering. We demonstrate and evaluate CPPC on artificial data and on our motivating domain of land cover classification. For the latter, an evaluation by domain experts shows that the algorithm discovers novel class definitions that are better suited to land cover classification than the original set of labels. |
关 键 词: | 标签; 底层数据; 确定模型; 土地覆盖分类 |
课程来源: | 视频讲座网 |
最后编审: | 2019-12-21:lxf |
阅读次数: | 59 |