0


从半结构化网页中提取引导信息

Bootstrapping Information Extraction from Semi-structured Web Pages
课程网址: http://videolectures.net/ecmlpkdd08_carlson_bief/  
主讲教师: Charles Schafer, Andrew Carlson
开课单位: 卡内基梅隆大学
开课时间: 2008-10-10
课程语种: 英语
中文简介:
我们考虑从半结构化网页中提取结构化记录的问题,而不需要每个目标网站进行人工监督。以前关于这个问题的工作要么需要为每个目标站点做出大量的人力工作,要么使用脆弱的启发法来识别语义数据类型。我们的方法只需要对目标域中的几个站点的几个页面进行注释。因此,经过人力投入的微小投入,我们的方法允许从同一域内的数千个其他站点自动提取。我们的方法扩展了以前的方法,通过使用数据值和上下文的健壮模型将这些字段与域模式列相匹配来检测半结构化网页中的数据字段。为4 6个网站注释2 5页,工作提供网站的提取准确率为83.8%,度假租赁网站提取率为91.1%。这些结果明显优于基线方法。
课程简介: We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. Thus, after a tiny investment of human effort, our method allows automatic extraction from potentially thousands of other sites within the same domain. Our approach extends previous methods for detecting data fields in semi-structured web pages by matching those fields to domain schema columns using robust models of data values and contexts. Annotating 2-5 pages for 4-6 web sites yields an extraction accuracy of 83.8% on job offer sites and 91.1% on vacation rental sites. These results significantly outperform a baseline approach.
关 键 词: 半结构化网页; 结构化记录; 自动提取
课程来源: 视频讲座网
最后编审: 2020-06-22:chenxin
阅读次数: 64