0


网页的时间快照,一个原则性的方法

Robust Web Extraction, A Principled Approach
课程网址: http://videolectures.net/akbc2010_bohannon_rwepa/  
主讲教师: Philip Bohannon
开课单位: 雅虎公司
开课时间: 信息不详。欢迎您在右侧留言补充。
课程语种: 英语
中文简介:
在脚本生成的web站点上,许多文档共享公共HTML树结构,允许包装器有效地提取感兴趣的信息。当然,脚本和树结构会随着时间的推移而演进,导致包装器多次中断,并导致维护包装器的高成本。本文探索了一种新的方法:利用网页的时间快照来开发HTML的树编辑模型,并利用该模型改进包装器的构造。我们将对树结构的更改看作一系列编辑操作的假设:删除节点、插入节点和替换节点的标签。树结构通过随机选择这些编辑操作而进化。我们的模型很吸引人,因为源树演化为目标树的概率可以有效地估计——在树的大小的二次时间内——这使它成为解决各种树演化问题的潜在有用工具。给出了一种从成对树的训练实例中学习概率模型的算法,并将该算法应用到网页快照集合中,得到html特定的树编辑模型。最后,我们描述了一个新的包装构建框架,该框架考虑了树编辑模型,并将结果包装器的质量与传统包装器的质量进行了比较。
课程简介: On script-generated web sites, many documents share common HTML tree structure, allowing wrappers to effectively extract information of interest. Of course, the scripts and thus the tree structure evolve over time, causing wrappers to break repeatedly, and resulting in a high cost of maintaining wrappers. In this paper, we explore a novel approach: we use temporal snapshots of web pages to develop a tree-edit model of HTML, and use this model to improve wrapper construction. We view the changes to the tree structure as suppositions of a series of edit operations: deleting nodes, inserting nodes and substituting labels of nodes. The tree structures evolve by choosing these edit operations stochastically. Our model is attractive in that the probability that a source tree has evolved into a target tree can be estimated efficiently -- in quadratic time in the size of the trees -- making it a potentially useful tool for a variety of tree-evolution problems. We give an algorithm to learn the probabilistic model from training examples consisting of pairs of trees, and apply this algorithm to collections of web-page snapshots to derive HTML-specific tree edit models. Finally, we describe a novel wrapper-construction framework that takes the tree-edit model into account, and compare the quality of resulting wrappers to that of traditional wrappers on synthetic and real HTML document examples.
关 键 词: 网页; 原则性方法
课程来源: 视频讲座网
最后编审: 2020-01-13:chenxin
阅读次数: 41