0


宏观阅读网络文本填充语义网络

Populating the Semantic Web by Macro-Reading Internet Text
课程网址: http://videolectures.net/iswc09_mitchell_ptsw/  
主讲教师: Tom Mitchell
开课单位: 卡内基梅隆大学
开课时间: 2009-11-24
课程语种: 英语
中文简介:
关于语义网未来的一个关键问题是“我们将如何获取结构化信息以大规模地填充语义网?”一种方法是手动输入此信息。第二种方法是利用已经存在于各种数据库中的大量结构化信息,并开发通用本体,发布标准和奖励系统,以使这些数据广泛可访问。我们在这里考虑第三种方法:开发自动从网络上存在的非结构化文本中提取结构化信息的软件。本讲座将调查从非结构化文本中提取结构化知识的尝试,并将重点关注具有我们假设使其可行的三个特征的方法。首先,与从单个文档中读取信息的非常困难的问题相比,我们考虑了同时读取数亿个文档的更容易的问题,以便我们的系统可以通过组合许多文档的证据来提取多次陈述的事实。 。其次,我们的系统从给定的本体开始,该本体定义了要提取的信息类型,使其能够集中精力并忽略与目标本体无关的大多数文本。第三,该系统使用一类新的半监督学习算法来学习如何从网页算法中提取信息,这些算法旨在在给定更复杂的本体时获得更高的准确性。我们的实验表明,这种方法可以产生包含成千上万个事实的知识库,以给定本体提供大约90%的准确度,从少数标记的训练样例和2亿个未标记的网页开始。
课程简介: A key question to the future of the semantic web is "how will we acquire structured information to populate the semantic web on a vast scale?" One approach is to enter this information manually. A second approach is to take advantage of the great deal of structured information already present in various databases, and to develop common ontologies, publishing standards, and reward systems to make this data widely accessible. We consider here a third approach: developing software that automatically extracts structured information from unstructured text present on the web. This talk will survey attempts to extract structured knowledge from unstructured text, and will focus on an approach with three characteristics that we hypothesize make it viable. First, in contrast to the very difficult problem of reading information from a single document, we consider the much easier problem of reading hundreds of millions of documents simultaneously, so that our system can extract facts that are stated many times by combining evidence from many documents. Second, our system begins with a given ontology that defines the types of information to be extracted, enabling it to focus its effort and to ignore most of the text which is irrelevant to the target ontology. Third, the system uses a new class of semi-supervised learning algorithms to learn how to extract information from web pages -- algorithms designed to achieve greater accuracy when given more complex ontologies. Our experiments show that this approach can produce knowledge bases containing tens of thousands of facts to populate given ontologies with approximately 90% accuracy, starting with only a handful of labeled training examples and 200 million unlabeled web pages.
关 键 词: 语义网; 非结构化文本; 半监督学习算法
课程来源: 视频讲座网
最后编审: 2019-05-05:lxf
阅读次数: 47