
Detecting Duplicate Web Documents using Clickthrough Data
课程网址: http://videolectures.net/wsdm2011_radlinski_duc/  
主讲教师: Filip Radlinski
开课单位: 微软公司
开课时间: 2011-08-09
课程语种: 英语
web包含许多重复和接近重复的文档。由于用户满意度受搜索结果中冗余信息的负面影响,大量的研究都致力于重复检测算法的开发。然而,大多数此类算法仅依赖于文档内容来检测重复,忽略了这样一个事实:重复检测的主要目标是识别包含与特定用户查询相关的冗余信息的文档。类似地,虽然依赖查询的结果多样化算法计算依赖查询的排序,但它们往往是基于独立于查询的内容相似度评分来计算的。 在本文中,我们通过展示用户在查询之后的单击行为如何为web文档的相对新新性提供证据,来弥合依赖查询的冗余和独立于查询的重复之间的差距。虽然以前关于解释用户在搜索结果上的点击的大多数工作都假设它们只反映了结果相关性,但是我们表明,由于用户在前面看到的文档上下文中考虑了搜索结果,所以点击还提供了关于web文档之间重复的信息。此外,我们发现重复解释了在单击行为中观察到的大量表示偏差。我们识别了web上常见的三种不同类型的冗余,并展示了如何使用单击数据来检测这些不同类型的冗余。
课程简介: The web contains many duplicate and near-duplicate documents. Given that user satisfaction is negatively affected by redundant information in search results, a significant amount of research has been devoted to developing duplicate detection algorithms. However, most such algorithms rely solely on document content to detect duplication, ignoring the fact that a primary goal of duplicate detection is to identify documents that contain redundant information with respect to a particular user query. Similarly, although query-dependent result diversification algorithms compute a query-dependent ranking, they tend to do so on the basis of a query-independent content similarity score. In this paper, we bridge the gap between query-dependent redundancy and query-independent duplication by showing how user click behavior following a query provides evidence about the relative novelty of web documents. While most previous work on interpreting user clicks on search results has assumed that they reflect just result relevance, we show that clicks also provide information about duplication between web documents since users consider search results in the context of previously seen documents. Moreover, we find that duplication explains a substantial amount of presentation bias observed in clicking behavior. We identify three distinct types of redundancy that commonly occur on the web and show how click data can be used to detect these different types.
关 键 词: 文档; 检测算法; 计算机科学
课程来源: 视频讲座网
最后编审: 2020-04-13:chenxin
阅读次数: 20