0


文档主题挖掘:站在大数据的肩膀上

Mining Topics in Documents: Standing on the Shoulders of Big Data
课程网址: http://videolectures.net/kdd2014_chen_mining_topics/  
主讲教师: Zhiyuan (Brett) Chen
开课单位: 伊利诺伊大学
开课时间: 2014-10-08
课程语种: 英语
中文简介:
主题建模被广泛用于从文档中挖掘主题。然而,主题建模的一个关键弱点是它需要大量的数据(例如,数千个文档)来提供可靠的统计信息以生成一致的主题。然而,在实践中,许多文档集合没有这么多文档。在文档数量较少的情况下,经典的主题模型LDA生成的主题非常糟糕。即使有大量的数据,对主题模型的无监督学习仍然会产生不令人满意的结果。近年来,基于知识的主题模型被提出,它要求人类用户提供一些先验的领域知识来指导模型生成更好的主题。我们的研究采用了截然不同的方法。我们建议像人类一样学习,也就是说,保留过去学到的结果,并用它们来帮助未来的学习。当面对一个新的任务时,我们首先从过去的学习/建模结果中挖掘一些可靠的(先验的)知识,然后用它来指导模型推理,以生成更连贯的主题。这种方法是可能的,因为在Web上可以随时获得大数据。提出的算法挖掘了两种形式的知识:必须链接(意味着两个词应该在同一个主题中)和不能链接(意味着两个词不应该在同一个主题中)。该方法还解决了知识自动挖掘中的两个问题,即错误知识和知识传递性问题。使用来自100个产品领域的评审文档的实验结果表明,所提出的方法比最先进的基线有显著的改进。
课程简介: Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.
关 键 词: 主题建模; 统计信息; 知识传递性
课程来源: 视频讲座网
数据采集: 2022-12-01:chenjy
最后编审: 2022-12-01:chenjy
阅读次数: 23