0


EmbedJoin:通过嵌入进行高效编辑相似性连接

EmbedJoin: Efficient Edit Similarity Joins via Embeddings
课程网址: http://videolectures.net/kdd2017_zhang_embedjoin/  
主讲教师: 张浩宇
开课单位: 印第安纳大学
开课时间: 2017-10-09
课程语种: 英语
中文简介:
我们研究编辑相似性连接的问题,其中给定一组字符串和阈值 K,我们希望输出编辑距离至多为 K 的所有字符串对。编辑相似性连接是数据清理/集成中的基本问题,生物信息学、协同过滤和自然语言处理,并已被确定为数据库系统的原始运算符。这个问题在文献中已经被广泛研究。然而,我们观察到所有现有算法在长字符串和大距离阈值方面都存在不足。 在本文中,我们提出了一种名为 EmbedJoin 的算法,该算法可以很好地缩放字符串长度和距离阈值。我们的算法建立在编辑距离度量嵌入的最新进展之上,与之前的所有方法都有很大不同。我们通过大量实验证明,EmbedJoin 在长字符串和大距离阈值方面明显优于之前的最佳算法。
课程简介: We study the problem of edit similarity joins, where given a set of strings and a threshold value K, we want to output all pairs of strings whose edit distances are at most K. Edit similarity join is a fundamental problem in data cleaning/integration, bioinformatics, collaborative filtering and natural language processing, and has been identified as a primitive operator for database systems. This problem has been studied extensively in the literature. However, we have observed that all the existing algorithms fall short on long strings and large distance thresholds. In this paper we propose an algorithm named EmbedJoin which scales very well with string length and distance threshold. Our algorithm is built on the recent advance of metric embeddings for edit distance, and is very different from all of the previous approaches. We demonstrate via an extensive set of experiments that EmbedJoin significantly outperforms the previous best algorithms on long strings and large distance thresholds.
关 键 词: 高效编辑; 编辑相似性连接; 原始运算符
课程来源: 视频讲座网
数据采集: 2023-12-26:wujk
最后编审: 2023-12-26:wujk
阅读次数: 22