0


一种有效的大图比较抽样方案

An Efficient Sampling Scheme For Comparison of Large Graphs
课程网址: http://videolectures.net/mlg07_borgwardt_aess/  
主讲教师: Karsten Michael Borgwardt
开课单位: 马克斯普朗克研究所
开课时间: 2007-09-05
课程语种: 英语
中文简介:
随着新图形结构化数据的生成,图形比较已成为分子生物学,电信,化学信息学和社交网络等应用领域中的一个重要且具有挑战性的问题。最近已经提出图形内核作为解决该问题的理论上合理的方法,并且已经证明可以在基准数据集上实现高精度。不同的图形内核比较输入图形中不同类型的子图形。到目前为止,要比较的子图的选择是临时的,并且通常由运行时考虑因素驱动。没有明确的迹象表明某些类型的子图比其他子图更好。另一方面,比较所有可能的子图已被证明是NP难的,因此实际上是不可行的。这些困难严重限制了图形内核的实际适用性。在本文中,我们尝试纠正这种情况,并使图形内核适用于大型图形和大型数据集上的数据挖掘。我们的出发点是矩阵重建定理,该定理指出任何大小为5或以上的矩阵都可以在给定其所有主要未成年人的情况下重建。通过将其应用于图的邻接矩阵,我们递归地定义图形核并且表明可以通过使用图的所有4个子图的分布来有效地计算它。我们认为,这种分布类似于图的充分统计,特别是当图形很大时。这些子图的详尽枚举非常昂贵,按比例缩放为O(n4)。但是,通过将分布的经验估计与真实分布的偏差联系起来,只需抽样一定数量的子图即可。顺便提一下,我们的界限比类似技术的生物信息学文献中的结果更强。在我们的实验评估中,我们的图形内核在时间和分类精度方面都优于最先进的图形内核。
课程简介: As new graph structured data is being generated, graph comparison has become an important and challenging problem in application areas such as molecular biology, telecommunications, chemoinformatics, and social networks. Graph kernels have recently been proposed as a theoretically sound approach to this problem, and have been shown to achieve high accuracies on benchmark datasets. Different graph kernels compare different types of subgraphs in the input graphs. So far, the choice of subgraphs to compare is rather ad-hoc and is often motivated by runtime considerations. There is no clear indication that certain types of subgraphs are better than the others. On the other hand, comparing all possible subgraphs has been shown to be NP-hard, thus making it practically infeasible. These difficulties seriously limit the practical applicability of graph kernels. In this article, we attempt to rectify the situation, and make graph kernels applicable for data mining on large graphs and large datasets. Our starting point is the matrix reconstruction theorem, which states that any matrix of size 5 or above can be reconstructed given all its principal minors. By applying this to the adjacency matrix of a graph, we recursively define a graph kernel and show that it can be efficiently computed by using the distribution of all size 4 subgraphs of a graph. This distribution, we argue, is similar to a sufficient statistic of the graph, especially when the graph is large. Exhaustive enumeration of these subgraphs is prohibitively expensive, scaling as O(n4). But, by bounding the deviation of the empirical estimates of the distribution from the true distribution, it suffices to sample a fixed number of subgraphs. Incidentally, our bounds are stronger than those found in the bio-informatics literature for similar techniques. In our experimental evaluation, our graph kernel outperforms state-of-the-art graph kernels both in times of time and classification accuracy.
关 键 词: 图形比较; 子图形; 矩阵重建定理
课程来源: 视频讲座网
最后编审: 2020-09-24:dingaq
阅读次数: 15