首页数学
0


皮特曼-尤尔用幂律主题模型

Topic Models with Power-Law Using Pitman-Yor Process
课程网址: http://videolectures.net/kdd2010_sato_tmplupyp/  
主讲教师: Issei Sato
开课单位: 东京大学
开课时间: 2010-10-01
课程语种: 英语
中文简介:
一个重要方法的知识发现和数据挖掘是估计观察到的变量,因为潜变量可以显示隐藏的特性,观察到的数据。潜在因素模型假设在记录中的每个项目都有一个潜在的因素;然后对项目的共生可以由潜在因素模型。在文档建模,记录表示文档表示为一个` `包的话,”这意味着单词的顺序被忽视,一项指示字和一个潜在因素表明一个话题。隐含狄利克雷分配(LDA)是一种广泛使用的贝叶斯主题模型施加Dirichlet分布在具有多个主题的文档潜在主题分布。LDA假设潜在的主题,即离散变量,根据多项分布的参数从Dirichlet分布产生的分布。LDA模型一词分布使用多项分布的参数如下Dirichlet分布。这的Dirichlet多项式的设置,然而,无法捕捉到一个词分布的幂律现象,这是已知的作为语言学中的Zipf定律。因此我们提出了一个新的主题模型使用皮特曼-尤尔(Py)的过程,称为PY主题模型。PY主题模型捕捉文档的两个性质;幂律分布和多个主题词的存在。在使用真实数据的实验,该模型优于LDA在文档建模方面的困惑。
课程简介: One important approach for knowledge discovery and data mining is to estimate unobserved variables because latent variables can indicate hidden specific properties of observed data. The latent factor model assumes that each item in a record has a latent factor; the co-occurrence of items can then be modeled by latent factors. In document modeling, a record indicates a document represented as a ``bag of words,'' meaning that the order of words is ignored, an item indicates a word and a latent factor indicates a topic. Latent Dirichlet allocation (LDA) is a widely used Bayesian topic model applying the Dirichlet distribution over the latent topic distribution of a document having multiple topics. LDA assumes that latent topics, i.e., discrete latent variables, are distributed according to a multinomial distribution whose parameters are generated from the Dirichlet distribution. LDA also models a word distribution by using a multinomial distribution whose parameters follows the Dirichlet distribution. This Dirichlet-multinomial setting, however, cannot capture the power-law phenomenon of a word distribution, which is known as Zipf’s law in linguistics. We therefore propose a novel topic model using the Pitman-Yor(PY) process, called the PY topic model. The PY topic model captures two properties of a document; a power-law word distribution and the presence of multiple topics. In an experiment using real data, this model outperformed LDA in document modeling in terms of perplexity.
关 键 词: 狄利克雷分配; 贝叶斯; 幂律现象
课程来源: 视频讲座网
最后编审: 2020-06-01:汪洁炜(课程编辑志愿者)
阅读次数: 375