0


通过精确和低工作量的提取集成电子表格数据

Integrating Spreadsheet Data via Accurate and Low-Effort Extraction
课程网址: http://videolectures.net/kdd2014_chen_spreadsheet_data/  
主讲教师: Zhe Chen
开课单位: 密歇根大学
开课时间: 2014-10-08
课程语种: 英语
中文简介:

电子表格包含许多主题的有价值的数据。但是,电子表格很难与其他数据源集成。将电子表格数据转换为关系模型将允许数据分析人员使用关系集成工具。我们提出了一个两阶段半自动系统,该系统可以在减少用户工作量的同时提取准确的关系元数据。基于无向图形模型,我们的系统支持下游电子表格集成应用程序。首先,自动提取器使用电子表格的图形样式提示和恢复的元数据来尽可能准确地提取电子表格数据。其次,交互式修复在散布在大型电子表格语料库中的不同电子表格中识别相似区域,从而使用户的单次手动修复可分摊到许多可能的提取错误中。我们的实验表明,在两个真实的数据集上,基于标准分类技术的人只需进行31%的人工操作即可获得准确的提取结果。

课程简介: Spreadsheets contain valuable data on many topics. However, spreadsheets are difficult to integrate with other data sources. Converting spreadsheet data to the relational model would allow data analysts to use relational integration tools. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on an undirected graphical model, our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets' graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user's single manual repair to be amortized over many possible extraction errors. Our experiments show that a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique on two real-world datasets.
关 键 词: 数据源集成; 图形模型
课程来源: 视频讲座网
数据采集: 2020-12-28:zyk
最后编审: 2020-12-28:zyk
阅读次数: 39