0


基于流形学习的主题识别模型

Discriminative Topic Modeling based on Manifold Learning
课程网址: http://videolectures.net/kdd2010_huh_dtnbml/  
主讲教师: Seungil Huh
开课单位: 卡内基梅隆大学
开课时间: 2010-10-01
课程语种: 英语
中文简介:
主题建模已广泛用于包括文本文档在内的各种领域中的数据分析。先前的主题模型,例如概率潜在语义分析(pLSA)和潜在Dirichlet分配(LDA),已经在发现用于建模文本文档的低级隐藏结构方面取得了令人瞩目的成功。然而,这些模型没有考虑数据的流形结构,这通常对于非线性维数减少映射提供信息。最近的模型,即拉普拉斯PLSI(LapPLSI)和局部一致主题模型(LTM),已将局部流形结构合并到主题模型中,并显示了由此带来的好处。但是这些方法缺乏流形学习的完全区分能力,因为它们仅增强相邻对的低秩表示之间的接近度而不考虑非相邻对。在本文中,我们提出了判别主题模型(DTM),它将相邻的对彼此分开,除了使相邻的对更紧密地结合在一起,从而保留全局流形结构以及提高局部一致性。我们还提出了一种基于广义EM和帕累托改进概念的新型模型拟合算法。因此,DTM通过有效地暴露数据的流形结构,在半监督设置中实现更高的分类性能。我们提供了关于文本语料库的经验证据,以证明DTM在分类准确性和与现有技术相比的参数稳健性方面的成功。
课程简介: Topic modeling has been popularly used for data analysis in various domains including text documents. Previous topic models, such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA), have shown impressive success in discovering low-rank hidden structures for modeling text documents. These models, however, do not take into account the manifold structure of data, which is generally informative for the non-linear dimensionality reduction mapping. More recent models, namely Laplacian PLSI (LapPLSI) and Locally-consistent Topic Model (LTM), have incorporated the local manifold structure into topic models and have shown the resulting benefits. But these approaches fall short of the full discriminating power of manifold learning as they only enhance the proximity between the low-rank representations of neighboring pairs without any consideration for non-neighboring pairs. In this paper, we propose Discriminative Topic Model (DTM) that separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving the local consistency. We also present a novel model fitting algorithm based on the generalized EM and the concept of Pareto improvement. As a result, DTM achieves higher classification performance in a semi-supervised setting by effectively exposing the manifold structure of data. We provide empirical evidence on text corpora to demonstrate the success of DTM in terms of classification accuracy and robustness to parameters compared to state-of-the-art techniques.
关 键 词: 主题建模; 语义分析; 非线性维数
课程来源: 视频讲座网
最后编审: 2019-05-11:lxf
阅读次数: 96