0


将网络分类与网络信息抽取相结合:一个案例研究

Towards Combining Web Classification and Web Information Extraction: A Case Study
课程网址: http://videolectures.net/kdd09_luo_tcwcwiecs/  
主讲教师: Ping Luo
开课单位: 中国科学院
开课时间: 2009-09-14
课程语种: 英语
中文简介:
Web内容分析通常有两个连续和独立的步骤:Web分类以识别目标网页,Web信息提取以提取目标网页中包含的元数据。这种分离策略非常无效,因为Web分类中的错误将传播到Web信息提取中,并最终累积到较高的级别。本文研究了这两个步骤之间的相互依赖关系,并提出用条件随机场模型(CRF)将它们结合起来。该模型可以同时识别目标网页并提取相应的元数据。我们的在线课程搜索项目中的系统实验表明,该模型显著提高了两个步骤的f1值。我们相信我们的模型可以很容易地推广到许多Web应用程序中。
课程简介: Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages, and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ineffective since the errors in Web classification will be propagated to Web information extraction and eventually accumulate to a high level. In this paper we study the mutual dependencies between these two steps and propose to combine them by using a model of Conditional Random Fields (CRFs). This model can be used to simultaneously recognize the target Web pages and extract the corresponding metadata. Systematic experiments in our project OfCourse for online course search show that this model significantly improves the F1 value for both of the two steps. We believe that our model can be easily generalized to many Web applications.
关 键 词: 网页分类; 信息提取; 元数据; 条件随机域模型
课程来源: 视频讲座网
最后编审: 2019-12-21:lxf
阅读次数: 31