0


使用压缩采样后速度近似文本文档聚类

Fast approximate text document clustering using Compressive Sampling
课程网址: http://videolectures.net/ecmlpkdd2011_park_fast/  
主讲教师: Laurence A. F. Park
开课单位: 墨尔本大学
开课时间: 2011-10-03
课程语种: 英语
中文简介:
文档聚类涉及对文档集的重复扫描,因此随着集合的大小增加,聚类任务所需的时间增加并且甚至可能由于计算约束而变得不可能。压缩采样是一种特征采样技术,它允许我们从少量样本中完美地重建矢量,前提是该矢量在某些已知域中是稀疏的。在本文中,我们使用k均值聚类将压缩采样背后的理论应用于文档聚类问题。我们提供了一种计算高精度簇的方法,该方法只需要通过直接聚类文档所花费的时间的一小部分。这是通过使用离散傅立叶变换和离散余弦变换来执行的。我们提供的实证结果表明,压缩采样提供了14倍的速度提升,而7,095个文档的准确度几乎没有降低,我们还提供了231,219文档集的非常准确的聚类,与执行k均值相比,速度提高了20倍集群在文档集上。这表明压缩聚类是一种非常有用的工具,可用于快速计算近似聚类。
课程简介: Document clustering involves repetitive scanning of a document set, therefore as the size of the set increases, the time required for the clustering task increases and may even become impossible due to computational constraints. Compressive sampling is a feature sampling technique that allows us to perfectly reconstruct a vector from a small number of samples, provided that the vector is sparse in some known domain. In this article, we apply the theory behind compressive sampling to the document clustering problem using k-means clustering. We provide a method of computing high accuracy clusters in a fraction of the time it would have taken by directly clustering the documents. This is performed by using the Discrete Fourier Transform and the Discrete Cosine Transform. We provide empirical results showing that compressive sampling provides a 14 times increase in speed with little reduction in accuracy on 7,095 documents, and we also provide a very accurate clustering of a 231,219 document set, providing 20 times increase in speed when compared to performing k-means clustering on the document set. This shows that compressive clustering is a very useful tool that can be used to quickly compute approximate clusters.
关 键 词: 文档聚类; 计算约束; 压缩采样
课程来源: 视频讲座网
最后编审: 2019-04-03:lxf
阅读次数: 25