0


生成LDA主题模型对主题意义排名

Topic Significance Ranking of LDA Generative Models
课程网址: http://videolectures.net/ecmlpkdd09_alsumait_tsrlda/  
主讲教师: Loulwah AlSumait
开课单位: 科威特大学
开课时间: 2009-10-20
课程语种: 英语
中文简介:
像Latent Dirichlet Allocation(LDA)这样的主题模型最近被用于自动生成文本语料库主题,并在这些主题中细分语料库单词。但是,并非所有估计的主题都具有同等重要性或与域的真实主题相对应。一些主题可以是不相关单词的集合,或代表无关紧要的主题。当前的主题建模方法执行手动检查以找到有意义的主题。本文介绍了第一个LDA模型的自动无监督分析,以识别合法的垃圾主题,并对主题意义进行排名。基本上,主题分布与“垃圾分布”的三个定义之间的距离是使用各种度量来计算的,从中使用4阶段加权组合方法来实现主题重要性的表达图。我们在合成和基准数据集上的实验表明了所提方法在对主题意义进行排序方面的有效性。
课程简介: Topic models, like Latent Dirichlet Allocation (LDA), have been recently used to automatically generate text corpora topics, and to subdivide the corpus words among those topics. However, not all the estimated topics are of equal importance or correspond to genuine themes of the domain. Some of the topics can be a collection of irrelevant words, or represent insignificant themes. Current approaches to topic modeling perform manual examination to find meaningful topics. This paper presents the first automated unsupervised analysis of LDA models to identify junk topics from legitimate ones, and to rank the topic significance. Basically, the distance between a topic distribution and three definitions of “junk distributions” is computed using a variety of measures, from which an expressive figure of the topic significance is implemented using 4-phase Weighted Combination approach. Our experiments on synthetic and benchmark datasets show the effectiveness of the proposed approach in ranking the topic significance.
关 键 词: 自动生成文本; 自动无监督; 语料库
课程来源: 视频讲座网
最后编审: 2019-03-23:lxf
阅读次数: 116