
Webpage Understanding: an Integrated Approach
课程网址: http://videolectures.net/kdd07_zhu_wu/  
主讲教师: Jun Zhu
开课单位: 清华大学
开课时间: 2007-09-14
课程语种: 英语
最近的工作表明了利用布局和标记树结构来分割网页和标记 html 元素的有效性。然而, 如何有效地对 html 元素中的文本内容进行分割和标注, 仍然是一个悬而未决的问题。由于网页上的许多文本内容通常都是文本片段, 并不严格意义上是语法, 因此通常期待语法句子的传统自然语言处理技术不再直接适用。在本文中, 我们研究了如何以一种有原则的方式使用布局和标记树结构来帮助理解网页上的文本内容。我们建议在一个联合判别概率模型中对网页的页面结构和文本内容进行分割和标注。在该模型中, 可以利用页面结构的语义标签来帮助理解文本内容, 而文本短语的语义标签可以用于页面结构理解任务, 如数据记录检测。因此, 页面结构和文本内容理解的集成导致了网页理解的集成解决方案。研究结果表明, 该方法具有可行性和前景。
课程简介: Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels of the text phrases can be used in page structure understanding tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research homepage extraction show the feasibility and promise of our approach.
关 键 词: 传统的自然语言处理技术; 标签树结构; 网页了解
课程来源: 视频讲座网
最后编审: 2020-05-31:吴雨秋(课程编辑志愿者)
阅读次数: 37