0


基于内容的文档路由和索引分区在大型语料库中的可扩展相似性搜索

Content-based Document Routing and Index Partitioning for Scalable Similarity-based Searches in a Large Corpus
课程网址: http://videolectures.net/kdd07_bhagwat_cbdr/  
主讲教师: Deepavali Bhagwat
开课单位: 加利福尼亚大学
开课时间: 2007-09-12
课程语种: 英语
中文简介:
我们提出了一种文档路由和索引分区方案,用于在大型语料库中基于可伸缩相似性的文档搜索。我们考虑通过查找具有与查询文档共同的特征的文档来执行基于相似性的搜索的情况。虽然可以将所有文档的所有功能存储在一个索引中,但这会遇到明显的可伸缩性问题。我们的方法是将功能索引划分为多个较小的分区,这些分区可以托管在不同的服务器上,从而实现可伸缩和并行搜索执行。将文档提取到存储库时,会选择少量分区来存储文档的功能。此外,为了执行基于相似性的搜索,仅查询少量分区。我们的方法是无国籍和渐进式。关于应该将文档的特征路由到哪些分区(用于在摄取时存储以及在查询时用于基于相似性的搜索)的决定仅基于文档的特征。我们的方法非常好。我们表明,在这样的分区搜索空间上执行基于相似性的搜索对搜索结果的精度和调用的影响最小,即使每个搜索仅占总分区数的3%。
课程简介: We present a document routing and index partitioning scheme for scalable similarity-based search of documents in a large corpus. We consider the case when similarity-based search is performed by finding documents that have features in common with the query document. While it is possible to store all the features of all the documents in one index, this suffers from obvious scalability problems. Our approach is to partition the feature index into multiple smaller partitions that can be hosted on separate servers, enabling scalable and parallel search execution. When a document is ingested into the repository, a small number of partitions are chosen to store the features of the document. To perform similarity-based search, also, only a small number of partitions are queried. Our approach is stateless and incremental. The decision as to which partitions the features of the document should be routed to (for storing at ingestion time, and for similarity based search at query time) is solely based on the features of the document. Our approach scales very well. We show that executing similarity-based searches over such a partitioned search space has minimal impact on the precision and recall of search results, even though every search consults less than 3% of the total number of partitions.
关 键 词: 语料库; 文档搜索; 文档路由
课程来源: 视频讲座网
最后编审: 2019-05-08:lxf
阅读次数: 20