网络文本中维基百科实体的集体标注Collective Annotation of Wikipedia Entities in Web Text |
|
课程网址: | http://videolectures.net/kdd09_chakrabarti_caowe/ |
主讲教师: | Sayali Kulkarni |
开课单位: | 印度理工学院 |
开课时间: | 2009-09-14 |
课程语种: | 英语 |
中文简介: | 要从基于关键字的搜索向基于实体的搜索迈出第一步,必须将文档上合适的标记跨度(";点";)标识为引用实体目录中的真实实体。一些系统已经被提议将网页上的点链接到维基百科的实体上。它们在很大程度上基于站点周围的文本和与实体关联的文本元数据之间的本地兼容性。最近的两个系统利用标签间的依赖关系,但方式有限。我们提出了一种通用的集体消歧方法。我们的前提是一致性文档指的是来自一个或几个相关主题或领域的实体。我们给出了本地点对实体兼容性和实体间全球一致性措施之间的权衡公式。优化整个实体分配是NP困难的。我们研究了基于局部爬坡、整线性规划和预聚类实体的实际解决方案,然后在聚类内进行局部优化。在涉及100多个人工注释网页和数万个点的实验中,我们的方法明显优于最近提出的算法。 |
课程简介: | To take the first step beyond keyword-based search toward entity-based search, suitable token spans ("spots") on documents must be identified as references to real-world entities from an entity catalog. Several systems have been proposed to link spots on Web pages to entities in Wikipedia. They are largely based on local compatibility between the text around the spot and textual metadata associated with the entity. Two recent systems exploit inter-label dependencies, but in limited ways. We propose a general collective disambiguation approach. Our premise is that coherent documents refer to entities from one or a few related topics or domains. We give formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities. Optimizing the overall entity assignment is NP-hard. We investigate practical solutions based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters. In experiments involving over a hundred manually-annotated Web pages and tens of thousands of spots, our approaches significantly outperform recently-proposed algorithms. |
关 键 词: | 计算机科学; 数据挖掘; 社会内容 |
课程来源: | 视频讲座网 |
最后编审: | 2020-06-15:heyf |
阅读次数: | 40 |