0


重复元组索引效率

Efficient Indexing of Repeated n-Grams
课程网址: http://videolectures.net/wsdm2011_huston_eir/  
主讲教师: Samuel Huston
开课单位: 马萨诸塞大学
开课时间: 2011-08-09
课程语种: 英语
中文简介:
文本中重复n字短语的识别具有许多实际应用,包括作者署名、文本重用识别和剽窃检测。我们考虑在文本语料库中查找重复n克的方法,重点介绍可以跨处理器集群有效扩展以处理大量文本的技术。我们将我们提出的方法与使用1.5 TB TREC ClueWeb-B文本集合的现有技术进行比较,使用单处理器和多处理器方法。实验表明,我们的方法在速度和临时存储空间之间提供了一个重要的权衡,并且提供了一种替代以前的方法,这种方法在序列长度上几乎是线性伸缩的,在很大程度上独立于n,并且跨可用处理器集提供了统一的工作负载平衡。
课程简介: The identification of repeated n-gram phrases in text has many practical applications, including authorship attribution, text reuse identification, and plagiarism detection. We consider methods for finding the repeated n-grams in text corpora, with emphasis on techniques that can be effectively scaled across a cluster of processors to handle very large amounts of text. We compare our proposed method to existing techniques using the 1.5 TB TREC ClueWeb-B text collection, using both single-processor and multi-processor approaches. The experiments show that our method offers an important tradeoff between speed and temporary storage space, and provides an alternative to previous approaches that scales almost linearly in the length of the sequence, is largely independent of n, and provides a uniform workload balance across the set of available processors.
关 键 词: 计算机科学; Web搜索; 多处理器; 文字识别
课程来源: 视频讲座网
最后编审: 2020-07-13:yumf
阅读次数: 32