0


一种潜在dirichlet分配谱算法

A Spectral Algorithm for Latent Dirichlet Allocation
课程网址: http://videolectures.net/machine_hsu_algorithm/  
主讲教师: Daniel Hsu
开课单位: 微软公司
开课时间: 2013-06-14
课程语种: 英语
中文简介:
主题建模是聚类的概括,假定观察(文档中的单词)由\ emph {多个}潜在因素(主题)生成,而不是仅仅一个。这种增加的表征能力是以更具挑战性的无监督学习问题为代价的,当仅观察到单词时,估计主题词分布,并隐藏主题。这项工作提供了一个简单而有效的学习过程,可以保证恢复各种主题模型的参数,包括Latent Dirichlet Allocation(LDA)。对于LDA,该过程正确地使用三元组统计数据(\ emph {ie},三阶矩,可以使用仅包含三个单词的文档进行估计)正确地恢复主题词分布和Dirichlet之前的主题混合参数。 。该方法称为过剩相关分析,基于通过两个奇异值分解(SVD)的低阶矩的谱分解。此外,该算法是可缩放的,因为SVD仅在k×k矩阵上执行,其中k是潜在因子(主题)的数量并且通常远小于观察(字)空间的维度。
课程简介: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by \emph{multiple} latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (\emph{i.e.}, third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k×k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space.
关 键 词: 主题建模; 聚类; 表征能力
课程来源: 视频讲座网
最后编审: 2020-04-26:chenxin
阅读次数: 48