首页 → 计算机科学技术
数据聚类:50年以后的K均值Data Clustering: 50 Years Beyond K-means |
|
| 课程网址: | http://videolectures.net/ecmlpkdd08_jain_dcyb/ |
| 主讲教师: | Anil K. Jain |
| 开课单位: | 密歇根州立大学 |
| 开课时间: | 2008-10-10 |
| 课程语种: | 英语 |
| 中文简介: | 根据感知的相似性对物体进行分类的做法是许多科学的基础。将数据组织成合理的分组是理解和学习的最基本模式之一。例如,一种常见的科学分类方案将生物体纳入分类等级:域,王国,门,类等。聚类分析是根据测量或感知的内在特征对对象进行分组的算法和方法的正式研究。聚类分析不使用标记具有先前标识符的对象的类别标签,即类标签。缺乏类别信息将聚类分析(无监督学习)与判别分析(监督学习)区分开来。聚类分析的目的是简单地找到方便有效的数据组织,而不是建立将未来数据分类的规则。聚类方法的发展是一项真正的跨学科努力。分类学家,社会科学家,心理学家,生物学家,统计学家,工程师,计算机科学家,医学研究人员以及收集和处理实际数据的其他人都为聚类方法做出了贡献。根据JSTOR,数据聚类首先出现在1954年关于人类学数据的文章的标题中。最着名,最简单和最流行的聚类算法之一是K-means。它由Steinhaus(1955),Lloyd(1957),Ball and Hall(1965)和McQueen(1967)独立发现!通过Google Scholar进行的搜索发现22,000个条目包含单词clustering和1,560个条目,单独使用2007年的数据聚类。在2006年和2007年在CVPR,ECML,ICDM,ICML,NIPS和SDM上发表的所有论文中,有150篇涉及聚类。这篇庞大的文献说明了聚类在机器学习,数据挖掘和模式识别中的重要性。群集由组合在一起的许多类似对象组成。虽然很容易给出集群的功能定义,但很难给出集群的操作定义。这是因为可以将对象分组为具有不同目的的群集。数据可以显示不同形状和大小的簇。因此,识别数据中的聚类的关键问题是指定或学习相似性度量。尽管已经发布了数千种聚类算法,但用户仍然面临着关于算法选择,距离度量,数据规范化,聚类数量和验证标准的两难选择。熟悉应用领域和集群目标肯定有助于做出明智的选择。本讲座将提供背景,讨论主要挑战和设计聚类算法的关键问题,总结众所周知的聚类方法,并指出一些新兴的研究方向,包括利用成对约束的半监督聚类,结合多个结果的集合聚类聚类,从边信息学习距离度量,以及同时进行特征选择和聚类。 |
| 课程简介: | The practice of classifying objects according to perceived similarities is the basis for much of science. Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms in to taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping objects according to measured or perceived intrinsic characteristics. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes cluster analysis (unsupervised learning) from discriminant analysis (supervised learning). The objective of cluster analysis is to simply find a convenient and valid organization of the data, not to establish rules for separating future data into categories. The development of clustering methodology has been a truly interdisciplinary endeavor. Taxonomists, social scientists, psychologists, biologists, statisticians, engineers, computer scientists, medical researchers, and others who collect and process real data have all contributed to clustering methodology. According to JSTOR, data clustering first appeared in the title of a 1954 article dealing with anthropological data. One of the most well-known, simplest and popular clustering algorithms is K-means. It was independently discovered by Steinhaus (1955), Lloyd (1957), Ball and Hall (1965) and McQueen (1967)! A search via Google Scholar found 22,000 entries with the word clustering and 1,560 entries with the words data clustering in 2007 alone. Among all the papers presented at CVPR, ECML, ICDM, ICML, NIPS and SDM in 2006 and 2007, 150 dealt with clustering. This vast literature speaks to the importance of clustering in machine learning, data mining and pattern recognition. A cluster is comprised of a number of similar objects grouped together. While it is easy to give a functional definition of a cluster, it is very difficult to give an operational definition of a cluster. This is because objects can be grouped into clusters with different purposes in mind. Data can reveal clusters of different shapes and sizes. Thus the crucial problem in identifying clusters in data is to specify or learn a similarity measure. In spite of thousands of clustering algorithms that have been published, a user still faces a dilemma regarding the choice of algorithm, distance metric, data normalization, number of clusters, and validation criteria. A familiarity with the application domain and clustering goals will certainly help in making an intelligent choice. This talk will provide background, discuss major challenges and key issues in designing clustering algorithms, summarize well known clustering methods, and point out some of the emerging research directions, including semi-supervised clustering that exploits pairwise constraints, ensemble clustering that combines results of multiple clusterings, learning distance metrics from side information, and simultaneous feature selection and clustering. |
| 关 键 词: | 聚类分析; k-均值; 距离度量 |
| 课程来源: | 视频讲座网 |
| 最后编审: | 2020-06-13:zyk |
| 阅读次数: | 163 |
