0


网络上开放域联合训练:监督受限时利用重叠

Joint Training for Open-domain Extraction on the Web: Exploiting Overlap when Supervision is Limited
课程网址: http://videolectures.net/wsdm2011_gupta_jto/  
主讲教师: Rahul Gupta
开课单位: 印度理工学院
开课时间: 2011-08-09
课程语种: 英语
中文简介:
我们考虑联合训练结构化模型从多个web源中提取的问题,这些web源的记录有部分内容重叠。这在开放域提取中有重要的应用,例如用户从多个相关的非结构化源物化一个感兴趣的表;或者像Freebase这样的站点通过从web源提取更多的行来增加不完整的关系。这类应用程序需要在任意域上进行提取,因此不能使用经过预先训练的提取器,也不能要求大量标记的数据集。我们建议通过使用跨相关web源的内容重叠来克服这种缺乏监督的问题。现有的利用重叠的方法是在一些设置下开发的,这些设置不容易概括Web源上重叠的规模和多样性。我们提出了一个基于协议的学习框架,通过偏压模型来共同训练模型同意协议区域,即共享文本片段。在我们的框架内,我们提出了替代方案,以权衡可伸缩性、对噪音的鲁棒性和执行协议的程度;并提出了一种划分协议区域的方案,该方案能在保证整体准确性的同时提高训练效率。此外,我们提出了一个原则的方案,以发现低噪声协议区域在多个来源的无标记数据。通过对58个不同提取领域的大量实验,我们发现,我们的框架提供了显著优于非耦合训练的改进,以及优于集体推理、阶段性训练和多视图学习等替代方法的得分。
课程简介: We consider the problem of jointly training structured models for extraction from multiple web sources whose records enjoy partial content overlap. This has important applications in open-domain extraction, e.g. a user materializing a table of interest from multiple relevant unstructured sources; or a site like Freebase augmenting an incomplete relation by extracting more rows from web sources. Such applications require extraction over arbitrary domains, so one cannot use a pre-trained extractor or demand a huge labeled dataset. We propose to overcome this lack of supervision by using content overlap across the related web sources. Existing methods of exploiting overlap have been developed under settings that do not generalize easily to the scale and diversity of overlap seen on Web sources. We present an agreement-based learning framework that jointly trains the models by biasing them to agree on the agreement regions, i.e. shared text segments. We present alternatives within our framework to trade-off tractability, robustness to noise, and extent of agreement enforced; and propose a scheme of partitioning agreement regions that leads to efficient training while maximizing overall accuracy. Further, we present a principled scheme to discover low-noise agreement regions in unlabeled data across multiple sources. Through extensive experiments over 58 different extraction domains, we establish that our framework provides significant boosts over uncoupled training, and scores over alternatives such as collective inference, staged training, and multi-view learning.
关 键 词: 计算机科学; Web挖掘; 结构化模型
课程来源: 视频讲座网
最后编审: 2020-06-15:heyf
阅读次数: 32