0


从噪声和多主题文档中提取关键术语

Extracting Key Terms From Noisy and Multitheme Documents
课程网址: http://videolectures.net/www09_grineva_ektnmd/  
主讲教师: Dmitry Lizorkin; Maxim Grinev; Maria Grineva
开课单位: 俄罗斯科学院系统规划研究所
开课时间: 2009-03-20
课程语种: 英语
中文简介:
提出了一种从文本文件中提取关键词的新方法。在我们的方法中,文档被建模为文档各术语间语义关系的图表。我们利用了图的以下显著特征:与文档主要主题相关的术语往往聚集成紧密相连的子图或社区,而不重要的术语则落入弱相互关联的社区,甚至成为孤立的顶点。我们应用图形社区检测技术将图形划分为具有主题内聚性的术语组。我们引入一个标准函数来选择包含关键术语的组,丢弃包含不重要术语的组。为了衡量术语的权重并确定它们之间的语义相关性,我们利用从维基百科中提取的信息。使用这种方法有以下两个优点。首先,它允许有效地处理多主题文档。其次,它善于过滤文档中的噪声信息,例如,网页中的导航栏或标题。对该方法的评价表明,该方法比现有的方法具有更高的精度和召回率。在网页上的附加实验证明,我们的方法比现有的方法更能有效地对噪声和多主题文档进行映射。
课程简介: We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non- important terms fall into weakly intercon-nected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia. Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages. Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more e ective on noisy and multi- theme documents than existing methods.
关 键 词: 计算机科学; 文本挖掘; 噪声; 关键术语
课程来源: 视频讲座网
最后编审: 2020-06-01:heyf
阅读次数: 44