0


基于规则的主动抽样排序学习算法

Rule-Based Active Sampling for Learning to Rank
课程网址: http://videolectures.net/ecmlpkdd2011_silva_rank/  
主讲教师: Rodrigo Silva
开课单位: 米纳斯吉拉斯联邦大学
开课时间: 2011-11-30
课程语种: 英语
中文简介:

学习排名(L2R)算法依赖于标记的训练集来生成排名模型,该模型随后可用于对新查询结果进行排名。生产这些标记的训练集通常非常昂贵,因为它需要人工注释者来评估训练集的相关性或对训练集中的元素进行排序。近来,已经提出了主动学习替代方案,以通过选择性地采样未标记的集合来减少标记的工作量。在本文中,我们提出了一种新颖的基于规则的主动学习排序方法。我们的方法会主动采样未标记的集合,并根据给定的先前选定和标记的示例生成的相关性推断规则,选择要标记的新文档。生成规则的数量越少,关于标记集的当前状态的文档就越不相似且更具“信息性”。与以前的解决方案不同,我们的算法不依赖初始训练种子,可以直接应用于未标记的数据集。与以前的工作相比,我们有一个明确的停止标准,不需要通过在验证或测试集上运行多次迭代来凭经验发现最佳配置。这些特征使我们的算法具有很高的实用性。我们在几个基准数据集上证明了我们的主动采样方法的有效性,表明训练规模的显着减少是可能的。我们的方法只选择原始训练集的1.1%,最多选择2.2%,而与使用完整训练集的先进监督L2R算法相比,则提供了竞争优势。

课程简介: Learning to rank (L2R) algorithms rely on a labeled training set to generate a ranking model that can be later used to rank new query results. Producing these labeled training sets is usually very costly as it requires human annotators to assess the relevance or order the elements in the training set. Recently, active learning alternatives have been proposed to reduce the labeling effort by selectively sampling an unlabeled set. In this paper we propose a novel rule-based active sampling method for Learning to Rank. Our method actively samples an unlabeled set, selecting new documents to be labeled based on how many relevance inference rules they generate given the previously selected and labeled examples. The smaller the number of generated rules, the more dissimilar and more "informative" is a document with regard to the current state of the labeled set. Differently from previous solutions, our algorithm does not rely on an initial training seed and can be directly applied to an unlabeled dataset. Also in contrast to previous work, we have a clear stop criterion and do not need to empirically discover the best configuration by running a number of iterations on the validation or test sets. These characteristics make our algorithm highly practical. We demonstrate the effectiveness of our active sampling method on several benchmarking datasets, showing that a significant reduction in training size is possible. Our method selects as little as 1.1% and at most 2.2% of the original training sets, while providing competitive results when compared to state-of-the-art supervised L2R algorithms that use the complete training sets.
关 键 词: 数据集; 算法依赖
课程来源: 视频讲座网
数据采集: 2021-03-20:zyk
最后编审: 2021-03-20:zyk
阅读次数: 61