大数据集群Big Data Clustering |
|
课程网址: | http://videolectures.net/single_jain_bigdata/ |
主讲教师: | Anil K. Jain |
开课单位: | 密西根州立大学 |
开课时间: | 2013-01-28 |
课程语种: | 英语 |
中文简介: | 数据聚类的目标是将n个对象的集合组织到k个聚类中,以使同一聚类中的对象比不同聚类中的对象彼此更相似。聚类是用于数据探索和数据组织的最受欢迎的工具之一,已被几乎所有收集数据的科学学科广泛使用。鉴于数据生成的指数级增长(到2020年估计将超过35万亿千兆字节),集群引起了人们越来越多的兴趣,并在社交网络,图像检索,网络搜索和基因表达分析等应用中得到了使用。在本次演讲中,我将介绍数据聚类问题并讨论大规模聚类研究中的挑战和机遇,重点是两个主要问题:(i)如何定义对象之间的成对相似性? (ii)如何有效地群集数亿个对象?我将以众所周知的内核k均值聚类算法为例,介绍我们最近的工作。我在分析和经验上都表明,近似内核k均值的性能与内核k均值算法相似,但是运行时复杂度和内存需求明显较低。 |
课程简介: | The goal of data clustering is to organize a set of n objects into k clusters such that objects in the same cluster are more similar to each other than objects in different clusters. Clustering is one of the most popular tools for data exploration and data organization that has been widely used in almost every scientific discipline that collects data. Given the exponential growth in data generation (estimated to be over 35 trillion gigabytes by the year 2020), clustering is receiving renewed interest and use in applications such as social networks, image retrieval, web search and gene expression analysis. In this talk I will introduce the data clustering problem and discuss the challenges and opportunities in the research on large-scale clustering, with the focus on two main issues: (i) how to define pairwise similarity between objects? and (ii) how to efficiently cluster hundreds of millions of objects? I will present our recent work in approximation of the well known kernel k-means clustering algorithm. I show both analytically and empirically that the performance of approximate kernel k-means is similar to that of the kernel k-means algorithm, but with significantly lower run-time complexity and memory requirements. |
关 键 词: | 数据聚类; 集合组织; 数据探索 |
课程来源: | 视频讲座网 |
最后编审: | 2019-09-21:cwx |
阅读次数: | 42 |