0


使用浅文本特征的样板检测

Boilerplate Detection Using Shallow Text Features
课程网址: http://videolectures.net/wsdm2010_kohlschutter_bdu/  
主讲教师: Christian Kohlschütter
开课单位: 汉诺威莱布尼兹大学
开课时间: 2010-10-07
课程语种: 英语
中文简介:
除了实际内容之外,Web页面还包含导航元素、模板和广告。这个样板文本通常与主要内容无关,可能会降低搜索精度,因此需要正确检测。在本文中,我们分析了一组用于对网页中单个文本元素进行分类的浅层文本特征。我们将该方法与复杂的、最先进的技术进行比较,结果表明,几乎不需要任何成本,就可以实现具有竞争力的准确性。此外,我们还推导了一个简单而合理的描述样板创建过程的随机模型。在我们的模型的帮助下,我们还量化了删除样板对检索性能的影响,并在基线上显示了显著的改进。最后,通过直接启发式方法对该方法进行了扩展,取得了显著的精度。
课程简介: In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state- of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy.
关 键 词: 计算机科学; 挖掘; 搜索精度
课程来源: 视频讲座网
最后编审: 2020-06-01:heyf
阅读次数: 65