首页计算机科学技术基础学科

通过点击数据来提高学习的训练数据的质量

Improving Quality of Training Data for Learning to Rank Using Click-Through Data
课程网址: http://videolectures.net/wsdm2010_li_iqot/  
主讲教师: Hang Li
开课单位: 微软公司
开课时间: 2010-10-12
课程语种: 英语
中文简介:
在信息检索中,文档与查询的相关性通常由人来判断,并用于评估和/或学习排名功能。以往的研究表明,在相关性判断中,一定程度的噪声对评价影响很小,特别是对于比较而言。最近,学习排名已经成为创建排名模型的主要手段之一,在这种模型中,模型自动从大量相关性判断得出的数据中学习。据我们所知,目前还没有关于学习排名训练数据质量的研究,本文试图对这一问题进行研究。具体来说,我们解决三个问题。首先,我们证明了人工标记训练数据的质量对学习排序算法的性能有着至关重要的影响。其次,我们提出使用在搜索引擎中累积的点击数据来检测相关性判断错误。提出了两种判别模型,即顺序依赖模型和完全依赖模型进行检测。这两个模型都考虑了相关性标签的条件依赖性,因此比先前为其他任务提出的条件独立模型更强大。最后,验证了利用该方法检测和修正误差的训练数据,可以提高学习排序算法的性能。
课程简介: In information retrieval, relevance of documents with respect to queries is usually judged by humans, and used in evaluation and/or learning of ranking functions. Previous work has shown that certain level of noise in relevance judgments has little effect on evaluation, especially for comparison purposes. Recently learning to rank has become one of the major means to create ranking models in which the models are automatically learned from the data derived from a large number of relevance judgments. As far as we know, there was no previous work about quality of training data for learning to rank, and this paper tries to study the issue. Specifically, we address three problems. Firstly, we show that the quality of training data labeled by humans has critical impact on the performance of learning to rank algorithms. Secondly, we propose detecting relevance judgment errors using click-through data accumulated at a search engine. Two discriminative models, referred to as sequential dependency model and full dependency model, are proposed to make the detection. Both models consider the conditional dependency of relevance labels and thus are more powerful than the conditionally independent model previously proposed for other tasks. Finally, we verify that using training data in which the errors are detected and corrected by our method, we can improve the performance of learning to rank algorithms.
关 键 词: 计算机科学; 信息检索; 模型
课程来源: 视频讲座网
最后编审: 2020-04-01:chenxin
阅读次数: 57