从网上提取公开信息Open Information Extraction from the Web |
|
课程网址: | http://videolectures.net/akbcwekex2012_etzioni_information_extrac... |
主讲教师: | Oren Etzioni |
开课单位: | 艾伦人工智能研究所 |
开课时间: | 2012-07-13 |
课程语种: | 英语 |
中文简介: | 传统上,信息提取(即)侧重于满足小型同类语料库(例如,从一组公告中提取研讨会的地点和时间)提出的精确、狭窄和预先指定的要求。转换到新域需要用户为目标关系命名,并手动创建新的提取规则或手工标记新的训练示例。这种体力劳动与目标关系的数量成线性关系。本文介绍了开放IE (OIE),这是一种新的提取范例,系统在不需要任何人工输入的情况下,对其语料库进行单数据驱动传递,提取大量的关系图。本文还介绍了TEXTRUNNER,一个完全实现的、高度可伸缩的OIE系统,在该系统中,元组被分配一个概率并建立索引,以支持通过用户查询进行有效的提取和探索。我们报告了一个超过900万页语料库的实验,比较TEXTRUNNER和KNOWITALL,一个最先进的网页IE系统。TEXTRUNNER在可比较的提取集上实现了33%的误差降低。此外,在KNOWITALL对少数预先指定的关系进行提取所需的时间内,TEXTRUNNER提取出了更广泛的一组事实,反映了动态发现的更多数量级的关系。我们对TEXTRUNNER’s 11,000,000个最高概率元组进行了统计,结果表明它们包含超过1,000,000个具体事实和超过6,500,000个更抽象的断言。 |
课程简介: | Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tupleswithout requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER’s 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000more abstract assertions. |
关 键 词: | 网络; 公开信息 |
课程来源: | 视频讲座网 |
最后编审: | 2020-06-08:heyf |
阅读次数: | 29 |