0


通过语言建模识别文档的原始思想贡献

Identifying the Original Contribution of a Document via Language Modeling
课程网址: http://videolectures.net/ecmlpkdd09_shaparenko_iocdlm/  
主讲教师: Benyah Shaparenko
开课单位: 康奈尔大学
开课时间: 2009-10-20
课程语种: 英语
中文简介:
文本挖掘的一个主要目标是提供自动方法,帮助人们掌握不断增加的文本语料库中的关键思想。为此,我们提出了一种统计上有根据的方法,用于识别文档对语料库的贡献的原始思想,侧重于自我参考的历时语料库,例如研究出版物,博客,电子邮件和新闻文章。我们的通过影响统计模型通过影响和新颖性的组合定义(有趣的)原始内容,并且该模型用于识别每个文档的最原始的段落。与启发式方法不同,统计模型是可扩展的,并且对分析开放。我们评估合成数据和研究出版物和新闻领域中的实际数据的方法,表明通过影响模型优于启发式基线方法。
课程简介: One major goal of text mining is to provide automatic methods to help humans grasp the key ideas in ever-increasing text corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and the model is used to identify each document’s most original passages. Unlike heuristic approaches, the statistical model is extensible and open to analysis. We evaluate the approach both on synthetic data and on real data in the domains of research publications and news, showing that the passage impact model outperforms a heuristic baseline method.
关 键 词: 文本挖掘; 文本语料库; 统计模型
课程来源: 视频讲座网
最后编审: 2019-03-27:lxf
阅读次数: 45