0


使用来自100多万个网站的数据扩展表

Extending Tables with Data from over a Million Websites
课程网址: http://videolectures.net/iswc2014_bizer_extending_tables/  
主讲教师: Christian Bizer
开课单位: 柏林弗雷大学
开课时间: 2014-12-09
课程语种: 英语
中文简介:
本次提交的大数据跟踪演示了BTC 2014数据集、来自数千个网站的Microdata注释以及数百万个HTML表是如何使用附加列扩展本地表的。在广泛的应用场景中,表扩展是一种有用的操作:假设您是一名分析师,有一个描述公司的本地表,并且您想扩展每个公司总部的表。或者想象一下,你是一个电影爱好者,想扩展一个表格,描述每部电影的导演、类型和上映日期等属性的电影。曼海姆搜索联合引擎基于从100多万个以各种格式发布结构化数据的网站收集的大型数据语料库,自动执行此类表扩展操作。给定一个局部表,曼海姆搜索联合引擎在语料库中搜索描述输入表实体的附加数据。然后将发现的数据与本地表连接起来,并使用模式匹配和数据融合方法合并它们的内容。结果,用户得到了一个扩展表,并有机会检查添加数据的来源。我们的实验表明,曼海姆搜索加入引擎在不同的应用场景中实现了接近100%的覆盖率和90%左右的精度。
课程简介: This Big Data Track submission demonstrates how the BTC 2014 dataset, Microdata annotations from thousands of websites, as well as millions of HTML tables are used to extend local tables with additional columns. Table extension is a useful operation within a wide range of application scenarios: Imagine you are an analyst having a local table describing companies and you want to extend this table with the headquarter of each company. Or imagine you are a film enthusiast and want to extend a table describing films with attributes like director, genre, and release date of each film. The Mannheim Search Joins Engine automatically performs such table extension operations based on a large data corpus gathered from over a million websites that publish structured data in various formats. Given a local table, the Mannheim Search Joins Engine searches the corpus for additional data describing the entities of the input table. The discovered data are then joined with the local table and their content is consolidated using schema matching and data fusion methods. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. Our experiments show that the Mannheim Search Joins Engine achieves a coverage close to 100% and a precision of around 90% within different application scenarios.
关 键 词: 数据跟踪; 网站注释; 数据扩展
课程来源: 视频讲座网
数据采集: 2023-06-10:chenxin01
最后编审: 2023-06-10:chenxin01
阅读次数: 17