0


在磁盘驻留图中的快速最近邻搜索

Fast Nearest Neighbor Search in Disk-resident Graphs
课程网址: http://videolectures.net/kdd2010_sarkar_fnnsdrg/  
主讲教师: Purnamrita Sarkar
开课单位: 卡内基梅隆大学
开课时间: 2010-10-01
课程语种: 英语
中文简介:
链接预测、个性化图形搜索、欺诈检测和许多这样的图形挖掘问题都围绕着对给定查询节点最“相似”的$K$节点的计算。一类广泛使用的相似性度量是基于图上的随机行走,例如个性化的pagerank、命中和上下班时间以及simrank。这些措施有两个基本问题。首先,现有的在线算法通常检查查询节点的局部邻域,当遇到高阶节点时,查询节点的局部邻域速度会明显变慢(在现实图形中常见的现象)。我们证明了将高阶节点转化为汇点只会产生很小的近似误差,同时大大提高了运行时间。第二个问题是当图太大而不能驻留在内存中时,在查询时计算相似性。显而易见的解决方案是将图分割成节点集群,并将每个集群存储在一个磁盘页面上;理想情况下,随机行走很少会跨越集群边界,并导致页面错误。我们在这里的贡献有两方面:(a)我们提出了一种有效的确定性算法,在这种聚集图中找到任何查询节点的k个最近邻(根据个性化的pagerank);(b)我们开发了一种仅使用顺序扫描数据文件的聚集算法(rwdisk)。对DBLP、Citeseer和Live Journal($\sim 90$m edges)等大型公开图表的实证结果表明,将高度节点转化为接收器不仅可以将rwdisk的运行时间提高3$s,还可以平均提高4$s的链路预测精度。我们还表明,RWdisk返回比流行的聚类算法Metis更理想(高电导和小尺寸)的聚类,同时需要更少的内存。最后,我们用于计算最近邻的确定性算法比实际模拟随机遍历所产生的页面错误(系数$5$)要少得多。
课程简介: Link prediction, personalized graph search, fraud detection, and many such graph mining problems revolve around the computation of the most ``similar'' $k$ nodes to a given query node. One widely used class of similarity measures is based on random walks on graphs, e.g., personalized pagerank, hitting and commute times, and simrank. There are two fundamental problems associated with these measures. First, existing online algorithms typically examine the local neighborhood of the query node which can become significantly slower whenever high-degree nodes are encountered (a common phenomenon in real-world graphs). We prove that turning high degree nodes into sinks results in only a small approximation error, while greatly improving running times. The second problem is that of computing similarities at query time when the graph is too large to be memory-resident. The obvious solution is to split the graph into clusters of nodes and store each cluster on a disk page; ideally random walks will rarely cross cluster boundaries and cause page-faults. Our contributions here are twofold: (a) we present an efficient deterministic algorithm to find the k closest neighbors (in terms of personalized pagerank) of any query node in such a clustered graph, and (b) we develop a clustering algorithm (RWDISK) that uses only sequential sweeps over data files. Empirical results on several large publicly available graphs like DBLP, Citeseer and Live-Journal ($\sim 90$ M edges) demonstrate that turning high degree nodes into sinks not only improves running time of RWDISK by a factor of $3$ but also boosts link prediction accuracy by a factor of $4$ on average. We also show that RWDISK returns more desirable (high conductance and small size) clusters than the popular clustering algorithm METIS, while requiring much less memory. Finally our deterministic algorithm for computing nearest neighbors incurs far fewer page-faults (factor of $5$) than actually simulating random walks.
关 键 词: 相似性度量; 聚类算法; 随机游动; 链接预测
课程来源: 视频讲座网
最后编审: 2019-12-21:lxf
阅读次数: 56