首页数学
0


流上的文档集合的主题模型推断的有效方法

Efficient Methods for Topic Model Inference on Streaming Document Collections
课程网址: http://videolectures.net/kdd09_mimno_emft/  
主讲教师: David Mimno
开课单位: 麻省大学
开课时间: 2009-12-14
课程语种: 英语
中文简介:
主题模型通过在低维子空间中表示高维数据,提供了分析大型文本集合的强大工具。在给定一组训练文档的情况下拟合主题模型需要计算上昂贵的近似推理技术。通过今天的大规模,不断扩展的文档集合,能够在不重新训练模型的情况下推断新文档的主题分布是有用的。在本文中,我们凭经验评估了几种主题推理方法在以前看不见的文档中的表现,包括基于Gibbs抽样的方法,变分推理和受文​​本分类启发的新方法。基于分类的推理方法产生类似于迭代推理方法的结果,但仅需要单个矩阵乘法。除了这些推理方法,我们还提出了SparseLDA,一种用于评估Gibbs采样分布的算法和数据结构。实证结果表明,SparseLDA可以比传统LDA快约20倍,并且提供的速度是先前发布的快速采样方法的两倍,同时还使用了更少的内存。
课程简介: Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference techniques that are computationally expensive. With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model. In this paper, we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, and a new method inspired by text classification. The classification-based inference method produces results similar to iterative inference methods, but requires only a single matrix multiplication. In addition to these inference methods, we present SparseLDA, an algorithm and data structure for evaluating Gibbs sampling distributions. Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.
关 键 词: 低维子空间的高维数据; 主题模型; 近似推理技术; 算法和数据评估
课程来源: 视频讲座网
最后编审: 2020-06-29:yumf
阅读次数: 127