0


基于优化核散列的可扩展相似性搜索

Scalable Similarity Search with Optimized Kernel Hashing
课程网址: http://videolectures.net/kdd2010_he_sss/  
主讲教师: Junfeng He
开课单位: 哥伦比亚大学
开课时间: 2010-10-01
课程语种: 英语
中文简介:
可扩展的相似性搜索是许多大规模学习或数据挖掘应用程序的核心。最近,许多研究结果表明,一种有前途的方法是创建紧凑且有效的哈希码,以保持数据相似性。通过有效,我们指的是生成的代码中的低相关性(并因此低冗余)。但是,大多数现有的哈希方法仅针对矢量数据设计。在本文中,我们开发了一种新的哈希算法,用于为具有任何内核函数的一般格式的大规模数据创建有效代码,包括向量,图形,序列,集合等的内核。从类似于频谱散列的想法开始,提出了新颖的公式和解决方案,使得可以明确地表示和优化基于内核的散列函数,并且直接应用于计算用于一般格式的新样本的紧凑散列码。此外,我们采用有效的技术,如Nystrom近似,进一步减少索引和搜索的时间和空间复杂性,使我们的算法可扩展到庞大的数据集。我们方法的另一个重要优点是能够根据实际任务要求处理各种类型的相似性,包括特征相似性和语义相似性,如标签一致性。我们使用矢量和非矢量数据集以大规模高达100万个样本评估我们的方法。我们的综合结果表明,所提出的方法优于所有任务的几种最先进的方法,大多数任务都有显着的增益。
课程简介: Scalable similarity search is the core of many large scale learning or data mining applications. Recently, many research results demonstrate that one promising approach is creating compact and efficient hash codes that preserve data similarity. By efficient, we refer to the low correlation (and thus low redundancy) among generated codes. However, most existing hash methods are designed only for vector data. In this paper, we develop a new hashing algorithm to create efficient codes for large scale data of general formats with any kernel function, including kernels on vectors, graphs, sequences, sets and so on. Starting with the idea analogous to spectral hashing, novel formulations and solutions are proposed such that a kernel based hash function can be explicitly represented and optimized, and directly applied to compute compact hash codes for new samples of general formats. Moreover, we incorporate efficient techniques, such as Nystrom approximation, to further reduce time and space complexity for indexing and search, making our algorithm scalable to huge data sets. Another important advantage of our method is the ability to handle diverse types of similarities according to actual task requirements, including both feature similarities and semantic similarities like label consistency. We evaluate our method using both vector and non-vector data sets at a large scale up to 1 million samples. Our comprehensive results show the proposed method outperforms several state-of-the-art approaches for all the tasks, with a significant gain for most tasks.
关 键 词: 哈希方法; 内核函数; 紧凑散列码
课程来源: 视频讲座网
最后编审: 2019-05-11:lxf
阅读次数: 61