0


多语种文本分析的对称对应主题模型

Symmetric Correspondence Topic Models for Multilingual Text Analysis
课程网址: http://videolectures.net/machine_fukumasu_models/  
主讲教师: Kosuke Fukumasu
开课单位: 神户大学
开课时间: 2013-04-16
课程语种: 英语
中文简介:
主题建模是一种广泛使用的分析大型文本集合的方法。最近已经探索了少量的多语言主题模型来发现并行或类似文档中的潜在主题,例如在维基百科中。最初为结构化数据提出的其他主题模型也适用于多语言文档。对应潜在Dirichlet分配(CorrLDA)就是这样一个模型;但是,它需要提前指定枢轴语言。我们提出了一个新的主题模型,Symmetric Correspondence LDA(SymCorrLDA),它包含一个隐藏变量来控制一个枢轴语言,在CorrLDA的扩展中。我们尝试了从维基百科中提取的两个多语言可比较数据集,并证明SymCorrLDA比其他一些现有的多语言主题模型更有效。
课程简介: Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models.
关 键 词: 主题建模; 文本集合; 数据集
课程来源: 视频讲座网
最后编审: 2021-05-20:wangjt
阅读次数: 85