0


基于Dirichlet先验参数混合模型的多主题文献知识发现

Knowledge Discovery of Multiple-topic Document using Parametric Mixture Model with Dirichlet Prior
课程网址: http://videolectures.net/kdd07_sato_kdomt/  
主讲教师: Issei Sato
开课单位: 东京大学
开课时间: 2007-09-14
课程语种: 英语
中文简介:
文档,例如在维基百科和Folksonomy上看到的文档,往往被分配多个主题作为元数据。因此,分析文档与分配给文档的主题之间的关系越来越重要。在本文中,我们提出了一种新的概率生成模型,其中多个主题的文档作为元数据。通过专注于对具有多个主题的文档的生成过程进行建模,我们可以提取具有多个主题的文档的特定属性。建议的模型是现有概率生成模型的扩展:参数混合模型(PMM)。 PMM通过混合每个主题的模型参数来为具有多个主题的文档建模。但是,由于PMM为每个主题分配相同的混合比率,因此PMM无法考虑文档中每个主题的偏差。为了解决这个问题,我们提出了一个模型,该模型将Dirichlet分布视为混合比的先验分布。我们采用变分贝叶斯方法来推断文档中每个主题的偏差。我们使用MEDLINE语料库评估所提出的模型和PMM。 F measure,Precision和Recall的结果表明,该模型在多主题分类方面比PMM更有效。此外,我们指出了所提出模型的潜力,该模型使用有关指定主题的信息提取主题并记录特定关键字。
课程简介: Documents, such as those seen onWikipedia and Folksonomy, have tended to be assigned with multiple topics as a meta-data. Therefore, it is more and more important to analyze a relationship between a document and topics assigned to the document. In this paper, we proposed a novel probabilistic generative model of documents with multiple topics as a meta-data. By focusing on modeling the generation process of a document with multiple topics, we can extract specific properties of documents with multiple topics. Proposed model is an expansion of an existing probabilistic generative model: Parametric Mixture Model (PMM). PMM models documents with multiple topics by mixing model parameters of each single topic. Since, however, PMM assigns the same mixture ratio to each single topic, PMM cannot take into account the bias of each topic within a document. To deal with this problem, we propose a model that considers Dirichlet distribution as a prior distribution of the mixture ratio. We adopt Variational Bayes Method to infer the bias of each topic within a document. We evaluate the proposed model and PMM using MEDLINE corpus. The results of F-measure, Precision and Recall show that the proposed model is more effective than PMM on multiple-topic classification. Moreover, we indicate the potential of the proposed model that extracts topics and document-specific keywords using information about the assigned topics.
关 键 词: 元数据; 概率生成模型; 分析文档
课程来源: 视频讲座网
最后编审: 2019-05-09:lxf
阅读次数: 35