
Information Genealogy: Uncovering the Flow of Ideas in Non-Hyperlinked Document Databases
课程网址: http://videolectures.net/kdd07_shaparenko_ig/  
主讲教师: Benyah Shaparenko
开课单位: 康奈尔大学
开课时间: 2007-08-13
课程语种: 英语
我们现在已经逐步增加了文本文档数据库,这些文档可以追溯到十多年,涉及个人电子邮件,新闻文章和会议记录等领域。虽然访问单个文档很容易,但是在概述和理解这些集合的方法中,缺乏数量和范围。在本文中,我们讨论了一个这样的全局分析任务,即自动发现想法如何随着时间的推移在集合中传播的问题。我们将此问题称为信息谱系。与仅限于具有明确引用结构的集合的文献计量方法相比,我们研究仅需要文档的文本和时间戳的基于内容的方法。特别是,我们提出了一种语言建模方法和一种似然比检验,以统计上有根据的方式检测文档之间的影响。此外,我们还展示了如何使用此方法推断引文图并识别集合中最具影响力的文档。 NIPS会议论文和Physics ArXiv的实验表明,我们的方法比基于文档相似性的方法更有效。
课程简介: We now have incrementally-grown databases of text documents ranging back for over a decade in areas ranging from personal email, to news-articles and conference proceedings. While accessing individual documents is easy, methods for overviewing and understanding these collections as a whole are lacking in number and in scope. In this paper, we address one such global analysis task, namely the problem of automatically uncovering how ideas spread through the collection over time. We refer to this problem as Information Genealogy. In contrast to bibliometric methods that are limited to collections with explicit citation structure, we investigate content-based methods requiring only the text and timestamps of the documents. In particular, we propose a language-modeling approach and a likelihood ratio test to detect influence between documents in a statistically well-founded way. Furthermore, we show how this method can be used to infer citation graphs and to identify the most influential documents in the collection. Experiments on the NIPS conference proceedings and the Physics ArXiv show that our method is more effective than methods based on document similarity.
关 键 词: 文本文档数据库; 语言建模; 似然比检验
课程来源: 视频讲座网
最后编审: 2019-05-09:lxf
阅读次数: 41