SC块:实体解析管道内的监督对比块SC-Block: Supervised Contrastive Blocking within Entity Resolution Pipelines |
|
课程网址: | https://videolectures.net/eswc2024_brinkmann_resolution_pipelines... |
主讲教师: | Alexander Brinkmann |
开课单位: | 2024年上海世博会 |
开课时间: | 2024-06-18 |
课程语种: | 英语 |
中文简介: | 数以百万计的网站使用schema.org词汇表在HTML页面中注释描述产品、本地业务或事件的结构化数据。整合语义网中的schema.org数据对实体解析方法提出了不同的要求:(1)方法必须扩展到数百万个实体描述,(2)方法必须能够处理大量数据源产生的异构性。为了扩展到众多实体描述,实体解析方法结合了用于候选对选择的阻断器和用于候选集中对的细粒度比较的匹配器。本文介绍了SC Block,这是一种使用监督对比学习在嵌入空间中对实体描述进行聚类的分块方法。嵌入使SC Block能够生成小的候选集,即使对于实体描述中涉及大量唯一令牌的用例也是如此。为了衡量语义网用例的阻塞方法的有效性,我们提出了一个新的基准WDC-Block。WDC Block要求阻止来自3259个使用schema.org词汇表的电子商店的产品报价。该基准的最大笛卡尔积为2000亿对报价,词汇量为700万个唯一令牌。我们使用WDC Block和其他阻塞基准的实验表明,SC Block产生的候选集平均比竞争阻塞方法生成的候选集小50%。将SC Block与最先进的匹配器相结合的实体解析管道的完成速度比使用其他阻断器的管道快1.5到4倍,F1得分没有任何损失。 |
课程简介: | Millions of websites use the schema.org vocabulary to annotate structured data describing products, local businesses, or events within their HTML pages. Integrating schema.org data from the Semantic Web poses distinct requirements to entity resolution methods: (1) the methods must scale to millions of entity descriptions and (2) the methods must be able to deal with the heterogeneity that results from a large number of data sources. In order to scale to numerous entity descriptions, entity resolution methods combine a blocker for candidate pair selection and a matcher for the fine-grained comparison of the pairs in the candidate set. This paper introduces SC-Block, a blocking method that uses supervised contrastive learning to cluster entity descriptions in an embedding space. The embedding enables SC-Block to generate small candidate sets even for use cases that involve a large number of unique tokens within entity descriptions. To measure the effectiveness of blocking methods for Semantic Web use cases, we present a new benchmark, WDC-Block. WDC-Block requires blocking product offers from 3,259 e-shops that use the schema.org vocabulary. The benchmark has a maximum Cartesian product of 200 billion pairs of offers and a vocabulary size of 7 million unique tokens. Our experiments using WDC-Block and other blocking benchmarks demonstrate that SC-Block produces candidate sets that are on average 50% smaller than the candidate sets generated by competing blocking methods. Entity resolution pipelines that combine SC-Block with state-of-the-art matchers finish 1.5 to 4 times faster than pipelines using other blockers, without any loss in F1 score. |
关 键 词: | SC块; 实体解析; 监督对比块 |
课程来源: | 视频讲座网 |
数据采集: | 2024-08-13:liyq |
最后编审: | 2024-08-13:liyq |
阅读次数: | 5 |