0


大数据集快速准确的k均值

Fast and Accurate k-means For Large Datasets
课程网址: http://videolectures.net/nips2011_shindler_largedatasets/  
主讲教师: Michael Shindler
开课单位: 俄勒冈州立大学
开课时间: 2012-01-25
课程语种: 英语
中文简介:
群集是许多应用程序的常见问题。我们认为$ k $意味着在数据太大而无法存储在主存储器中并且必须按顺序访问的情况下,例如从磁盘访问,并且我们必须使用尽可能少的内存。我们的算法基于最近的理论结果,并进行了重大改进以使其实用。我们的方法极大地简化了最近开发的设计和分析算法,并消除了近似保证,存储器要求和运行时间中的大常数因素。然后我们合并近似最近邻搜索以$ o(nk)$计算$ k $均值(其中$ n $是数据点的数量;请注意,在给定解决方案的情况下计算成本需要$ \ Theta(nk)$时间)。我们证明了我们的算法在理论上和实验上都优于现有算法,从而在理论和实践上都提供了最先进的性能。
课程简介: Clustering is a popular problem with many applications. We consider the $k$-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute $k$-means in $o(nk)$ (where $n$ is the number of data points; note that computing the cost, given a solution, takes $\Theta(nk)$ time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
关 键 词: 群集; 磁盘访问; 分析算法
课程来源: 视频讲座网
最后编审: 2019-07-26:cwx
阅读次数: 68