0


在压缩文本Web上

On Compressing the Textual Web
课程网址: http://videolectures.net/wsdm2010_ferragina_octt/  
主讲教师: Paolo Ferragina
开课单位: 比萨大学
开课时间: 2010-10-07
课程语种: 英语
中文简介:
现在我们知道如何有效地压缩任何现代搜索引擎的最基本的组件,例如,由网络结构和/或它的使用产生的图表,发布列表,和术语字典。但据我们所知,没有任何研究深入探讨压缩原始网页的问题。许多Web应用程序使用简单的压缩算法;例如gzip,或基于单词的移动到前面或Huffman coders—并得出结论,即使压缩的原始数据也比倒排表占用更多的空间。在本文中,我们研究了在大型Web集合中使用数据压缩的两个典型场景。在第一个场景中,压缩的页面存储在磁盘上,我们只需要支持com- pressed集合的大部分的快速扫描(例如map-reduce范例)。在第二个场景中,我们考虑快速访问压缩集合的各个页面,这些页面分布在许多pc的ram中(例如搜索引擎和矿商)。对于第一个场景,我们提供了最先进的压缩器之间的彻底的实验比较,从而指出可用解决方案的优缺点。对于第二种方案,我们将压缩存储解决方案与压缩自索引的新技术进行比较。我们的结果表明,Web页面比预期的更容易压缩,因此,应该重新考虑这一领域的一些共同信念。我们的结果对于测试方法的大范围和数据集的大小来说是新颖的,并且提供了三方面的贡献:为设计新的压缩存储解决方案提供了一个重要的基线,为面对web页面存储的软件开发人员提供了指南,并对最近关于反列表压缩的数字提供了自然的补充。
课程简介: Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing the raw Web pages. Many Web applications use simple compression algorithms— e.g. gzip, or word-based Move-to-Front or Huffman coders— and conclude that, even compressed, raw data take more space than Inverted Lists. In this paper we investigate two typical scenarios of use of data compression for large Web collections. In the first scenario, the compressed pages are stored on disk and we only need to support the fast scanning of large parts of the com- pressed collection (such as for map-reduce paradigms). In the second scenario, we consider the fast access to individual pages of the compressed collection that is distributed among the RAMs of many PCs (such as for search engines and miners). For the first scenario, we provide a thorough experimental comparison among state-of-the-art compressors thus indicating pros and cons of the available solutions. For the second scenario, we compare compressed-storage solutions with the new technology of compressed self-indexes. Our results show that Web pages are more compressible than expected and, consequently, that some common beliefs in this area should be reconsidered. Our results are novel for the large spectrum of tested approaches and the size of datasets, and provide a threefold contribution: a non- trivial baseline for designing new compressed-storage solutions, a guide for software developers faced with Web-page storage, and a natural complement to the recent figures on InvertedList-compression.
关 键 词: 计算机科学; 语义Web; 搜索引擎; 压缩算法
课程来源: 视频讲座网
最后编审: 2019-10-28:lxf
阅读次数: 20