0


我们可以学习一个单独的培训网站的新闻文章的独立包装的模板吗?

Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site?
课程网址: http://videolectures.net/kdd09_wang_cwltiwnaests/  
主讲教师: Junfeng Wang
开课单位: 浙江大学
开课时间: 2009-09-14
课程语种: 英语
中文简介:
从新闻页面自动提取新闻在许多 web 应用程序 (如新闻聚合) 中非常重要。然而, 现有的基于模板级包装诱导的新闻提取方法存在三个严重的局限性。首先, 现有的方法无法正确提取属于看不见的模板的页面。其次, 为大量新闻网站维护最新包装的成本很高, 因为模板的任何更改都可能使相应的包装无效。最后, 现有的方法只能提取未格式化的纯文本, 因此不方便用户使用。本文以一种用户友好的方式解决了与模板无关的 web 新闻提取问题。我们将 web 新闻提取正式化为机器学习问题, 并使用来自单个站点的极少量标记的新闻页面来学习与模板无关的包装器。开发了专门介绍新闻标题和正文的新功能。利用新闻标题与新闻机构之间的相关性。我们独立于模板的包装器可以从不同的网站提取新闻页面, 而不考虑模板。此外, 我们的方法不仅可以提取文本, 还可以提取新闻机构内的图像和动画, 提取的新闻文章与原始页面中的视觉风格相同。在我们的实验中, 从一个新闻网站的40页上学到的包装在12个新闻网站的 3, 98.1 新闻页面上实现了98.1% 的准确率。
课程简介: Automatic news extraction from news pages is important in many Web applications such as news aggregation. However, the existing news extraction methods based on template-level wrapper induction have three serious limitations. First, the existing methods cannot correctly extract pages belonging to an unseen template. Second, it is costly to maintain up-to-date wrappers for a large amount of news websites, because any change of a template may invalidate the corresponding wrapper. Last, the existing methods can merely extract unformatted plain texts, and thus are not user friendly. In this paper, we tackle the problem of template-independent Web news extraction in a user-friendly way. We formalize Web news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed. Correlations between news titles and news bodies are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. Moreover, our approach can extract not only texts, but also images and animates within the news bodies and the extracted news articles are in the same visual style as in the original pages. In our experiments, a wrapper learned from 40 pages from a single news site achieved an accuracy of 98.1% on 3,973 news pages from 12 news sites.
关 键 词: 纯文本; 机器学习; 新闻
课程来源: 视频讲座网
最后编审: 2020-06-20:zyk
阅读次数: 67