0


使用与域无关的候选选择方法自动生成数据链接

Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach
课程网址: http://videolectures.net/iswc2011_song_linkages/  
主讲教师: Dezhao Song
开课单位: 利哈伊大学
开课时间: 2011-11-25
课程语种: 英语
中文简介:
关联数据的一个挑战是可扩展地建立高质量的owl:sameAs在不同数据源中的实例(例如,人,地理位置,出版物等)之间的链接。此实体共指问题的传统方法无法扩展,因为它们会详尽地比较每对实例。在本文中,我们提出了一种候选选择算法,用于修剪实体共指的搜索空间。我们通过计算使用域独立无监督学习选择的区分文字值的字节级别相似性​​来选择候选实例对。我们在所选谓词的文字值上索引实例以有效地查找类似的实例。我们在两个RDF和三个结构化数据集上评估我们的方法。我们表明,传统指标并不总能准确反映候选人选择的相对好处,并提出了额外的指标。我们表明,我们的算法经常优于替代方案,并且能够在一个小时内在一个Sun工作站上处理100万个实例。此外,在RDF数据集上,我们通过应用我们的技术显示整个实体共同参照过程可以很好地扩展。令人惊讶的是,这种高召回率,低精度过滤机制经常会导致整个系统中的F分数更高。
课程简介: One challenge for Linked Data is scalably establishing high quality owl:sameAs links between instances (e.g., people, geographical locations, publications, etc.) in different data sources. Traditional approaches to this entity coreference problem do not scale because they exhaustively compare every pair of instances. In this paper, we propose a candidate selection algorithm for pruning the search space for entity coreference. We select candidate instance pairs by computing a character-level similarity on discriminating literal values that are chosen using domain-independent unsupervised learning.We index the instances on the chosen predicates’ literal values to efficiently look up similar instances. We evaluate our approach on two RDF and three structured datasets. We show that the traditional metrics don’t always accurately reflect the relative benefits of candidate selection, and propose additional metrics. We show that our algorithm frequently outperforms alternatives and is able to process 1 million instances in under one hour on a single Sun Workstation. Furthermore, on the RDF datasets, we show that the entire entity coreference process scales well by applying our technique. Surprisingly, this high recall, low precision filtering mechanism frequently leads to higher F-scores in the overall system.
关 键 词: 关联数据; 候选选择算法; 无监督学习
课程来源: 视频讲座网
最后编审: 2019-05-05:lxf
阅读次数: 47